Loading Now

Summary of Enhancing Low-resource Language and Instruction Following Capabilities Of Audio Language Models, by Potsawee Manakul et al.


Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

by Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul

First submitted to arxiv on: 17 Sep 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper investigates the performance of audio language models in an underserved language, using Thai as an example. It reveals that despite being built on multilingual backbones, these models do not exhibit cross-lingual abilities to low-resource languages. To address this limitation, the paper proposes a data mixture approach for developing audio language models optimized for a target language and English. The authors also integrate audio comprehension and speech instruction-following capabilities into a single unified model. Experimental results demonstrate that their proposed model, Typhoon-Audio, outperforms existing open-source audio language models by a significant margin, comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai languages.
Low GrooveSquid.com (original content) Low Difficulty Summary
Audio language models can understand audio inputs and perform tasks like speech recognition and captioning based on text prompts. These models are usually built from pre-trained audio encoders and large language models (LLMs). However, current models are mostly trained on English data, limiting their use to only English instructions or speech inputs. This paper looks at how well existing audio language models work in Thai, a less common language. It finds that these models don’t have the ability to understand languages other than English. To solve this problem, the authors mix different types of data together to create a new model that can understand both Thai and English. They also combine two important tasks – understanding audio and following instructions – into one model. The results show that their new model, Typhoon-Audio, is much better at these tasks than existing models.

Keywords

» Artificial intelligence  » Gemini