Summary of Wavllm: Towards Robust and Adaptive Speech Large Language Model, by Shujie Hu et al.
WavLLM: Towards Robust and Adaptive Speech Large Language Model
by Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei
First submitted to arxiv on: 31 Mar 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper introduces WavLLM, a robust and adaptive speech large language model that can process both semantic content and speaker identity. The model features dual encoders and a prompt-aware LoRA weight adapter optimized through a two-stage curriculum learning approach. This allows WavLLM to learn foundational capabilities on mixed elementary single tasks before moving on to advanced multi-task training. The model achieves state-of-the-art performance across various speech tasks, including ASR, ST, SV, ER, and demonstrates robust generalization capabilities in executing complex tasks using the CoT approach. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary WavLLM is a new kind of computer program that can understand and process spoken language. It’s like having a super smart ear! The researchers developed WavLLM to be really good at understanding different types of speech, including what people are saying and who they are. They used a special way of training the model called curriculum learning, which helped it learn how to do lots of different tasks all at once. This means WavLLM can understand and respond to spoken language in many different ways, like answering questions or completing tasks. |
Keywords
* Artificial intelligence * Curriculum learning * Generalization * Large language model * Lora * Multi task * Prompt