Summary of An Embarrassingly Simple Approach For Llm with Strong Asr Capacity, by Ziyang Ma et al.
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
by Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
First submitted to arxiv on: 13 Feb 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper tackles automatic speech recognition (ASR) by combining off-the-shelf speech encoders, large language models (LLMs), and linear projectors. Surprisingly, a simple composition of these components achieves state-of-the-art performance on the Librispeech benchmark, outperforming previous LLM-based ASR models. The proposed SLAM-ASR system requires minimal task-specific design and only trains the linear projector. By exploring various combinations of LLMs and speech encoders, the authors demonstrate the effectiveness of this approach. Additionally, they investigate the emergence of modal alignment capabilities in LLM-based ASR systems. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper makes automatic speech recognition better by combining different parts together. It shows that a simple way to do this works really well on a famous benchmark called Librispeech. The authors created something called SLAM-ASR, which is easy to set up and doesn’t need special training for each task. This means it can be used in many different situations. They also looked at how well this approach worked when trying to recognize speech from different sources. |
Keywords
» Artificial intelligence » Alignment