Summary of How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not, by Francesco Verdini et al.

How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not

by Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, Sébastien Bratières, Paolo Merialdo, Simone Scardapane

First submitted to arxiv on: 25 Sep 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper investigates how Large Language Models (LLMs) can be leveraged for speech-to-text (S2T) tasks. It proposes projecting the output of Speech Foundational Models (SFMs) into the LLM embedding space using an adapter module. The study evaluates five different adapters, two LLMs (Mistral and Llama), and two SFMs (Whisper and SeamlessM4T) on Automatic Speech Recognition and Speech Translation tasks. Results show that the SFM has a crucial impact on downstream performance, while the adapter choice has moderate effects depending on the SFM and LLM.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper looks at how to use big language models for speech recognition. It tries different ways of combining the output of speech models with language models. The study shows which combinations work best for recognizing words from audio recordings and translating speech into text. The results say that the quality of the speech model matters most, while the adapter module has some impact depending on the model used.

Keywords

» Artificial intelligence » Embedding space » Llama » Translation

How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not

by Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, Sébastien Bratières, Paolo Merialdo, Simone Scardapane

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Risk-averse Learning with Delayed Feedback, by Siyi Wang et al.

Summary of Accumulator-aware Post-training Quantization, by Ian Colbert et al.

Related Posts