Loading Now

Summary of How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not, by Francesco Verdini et al.


How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not

by Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, Sébastien Bratières, Paolo Merialdo, Simone Scardapane

First submitted to arxiv on: 25 Sep 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper investigates how Large Language Models (LLMs) can be leveraged for speech-to-text (S2T) tasks. It proposes projecting the output of Speech Foundational Models (SFMs) into the LLM embedding space using an adapter module. The study evaluates five different adapters, two LLMs (Mistral and Llama), and two SFMs (Whisper and SeamlessM4T) on Automatic Speech Recognition and Speech Translation tasks. Results show that the SFM has a crucial impact on downstream performance, while the adapter choice has moderate effects depending on the SFM and LLM.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper looks at how to use big language models for speech recognition. It tries different ways of combining the output of speech models with language models. The study shows which combinations work best for recognizing words from audio recordings and translating speech into text. The results say that the quality of the speech model matters most, while the adapter module has some impact depending on the model used.

Keywords

» Artificial intelligence  » Embedding space  » Llama  » Translation