Loading Now

Summary of It’s Never Too Late: Fusing Acoustic Information Into Large Language Models For Automatic Speech Recognition, by Chen Chen et al.


It’s Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

by Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

First submitted to arxiv on: 8 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Recent research has shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of automatic speech recognition (ASR) output. However, current approaches introduce extra data uncertainty since the LLM is trained without considering acoustic information available in the speech signal. This paper proposes a novel late fusion solution, Uncertainty-Aware Dynamic Fusion (UADF), to overcome this limitation. UADF is a multimodal fusion approach implemented into an auto-regressive decoding process, which works in two stages: analyzing and calibrating token-level LLM decisions and dynamically assimilating acoustic information. Experimental results from various ASR tasks demonstrate that UADF outperforms existing fusion mechanisms, yielding significant improvements in word error rate (WER) while mitigating data uncertainty issues. Additionally, UADF shows excellent generalization capabilities and adapts seamlessly to audio-visual speech recognition.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper improves a technology called automatic speech recognition (ASR). It uses computers to understand what people are saying. The current way of doing this has some problems because it doesn’t take into account the sounds that make up the words. The researchers came up with a new method, called Uncertainty-Aware Dynamic Fusion (UADF), which makes the computer better at understanding what’s being said. They tested UADF and found that it works much better than the current way. It also gets better over time and can handle different types of speech recognition.

Keywords

» Artificial intelligence  » Generalization  » Token