Summary of A Multimodal Approach to Device-directed Speech Detection with Large Language Models, by Dominik Wagner et al.
A Multimodal Approach to Device-Directed Speech Detection with Large Language Models
by Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
First submitted to arxiv on: 21 Mar 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates ways to make interactions with virtual assistants more intuitive by eliminating the requirement for a trigger phrase. Researchers explore three approaches: using acoustic information from audio waveforms, decoder outputs from automatic speech recognition (ASR) systems as input features to large language models (LLMs), and multimodal systems combining acoustic, lexical, and ASR signals in LLMs. The multimodal approach achieves significant relative equal-error-rate improvements of up to 39% and 61% compared to text-only and audio-only models. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper looks into making interactions with virtual assistants more natural by getting rid of the need for a specific trigger phrase. They try three ideas: using just what sounds like noise in the audio, the top guesses from automatic speech recognition (ASR) systems as clues for big language models (LLMs), and a system that uses all kinds of information – sound, words, and ASR hints – to help LLMs make sense of commands. This way of combining different types of information makes a big difference, with results up to 39% and 61% better than just using text or audio. |
Keywords
* Artificial intelligence * Decoder