Loading Now

Summary of An Embarrassingly Simple Approach For Llm with Strong Asr Capacity, by Ziyang Ma et al.


An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

by Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen

First submitted to arxiv on: 13 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper tackles automatic speech recognition (ASR) by combining off-the-shelf speech encoders, large language models (LLMs), and linear projectors. Surprisingly, a simple composition of these components achieves state-of-the-art performance on the Librispeech benchmark, outperforming previous LLM-based ASR models. The proposed SLAM-ASR system requires minimal task-specific design and only trains the linear projector. By exploring various combinations of LLMs and speech encoders, the authors demonstrate the effectiveness of this approach. Additionally, they investigate the emergence of modal alignment capabilities in LLM-based ASR systems.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper makes automatic speech recognition better by combining different parts together. It shows that a simple way to do this works really well on a famous benchmark called Librispeech. The authors created something called SLAM-ASR, which is easy to set up and doesn’t need special training for each task. This means it can be used in many different situations. They also looked at how well this approach worked when trying to recognize speech from different sources.

Keywords

» Artificial intelligence  » Alignment