Loading Now

Summary of Salsa: Speedy Asr-llm Synchronous Aggregation, by Ashish Mittal et al.


SALSA: Speedy ASR-LLM Synchronous Aggregation

by Ashish Mittal, Darshan Prabhu, Sunita Sarawagi, Preethi Jyothi

First submitted to arxiv on: 29 Aug 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Harnessing pre-trained large language models (LLMs) has emerged as a key area for improving automatic speech recognition (ASR) systems, particularly for low-resource languages. Current methods either rely on LLMs for ASR error correction or tightly couple the two systems to replace the ASR decoder with the LLM. However, these approaches often increase decoding time or require expensive training of cross-attention layers. To address this, we propose SALSA, a novel approach that couples the decoder layers of the ASR system to those of the LLM, while synchronously advancing both decoders. This is achieved through a simple projection of the last decoder state, making it significantly more training efficient than earlier approaches. The proposed coupling also requires handling tokenization mismatches between the LLM and ASR systems, which we address using cascading tokenization with respect to the LLM and ASR vocabularies. Our evaluation on 8 low-resource languages in the FLEURS benchmark demonstrates substantial word error rate (WER) reductions of up to 38%.
Low GrooveSquid.com (original content) Low Difficulty Summary
Scientists are working on ways to improve speech recognition technology, especially for languages that don’t have a lot of data available. They’re doing this by using large language models, which are powerful tools that can help correct mistakes in speech recognition systems. The problem is that these models often require a lot of training and time, but the new approach proposed by researchers called SALSA solves this problem. It’s a way to connect the two systems together so they work better together, making it more efficient. They also had to find a way to make sure the language model and the speech recognition system are using the same words in the right way. The results show that this new approach can reduce errors by up to 38% in languages with limited data.

Keywords

» Artificial intelligence  » Cross attention  » Decoder  » Language model  » Tokenization