Summary of A Multimodal Approach to Device-directed Speech Detection with Large Language Models, by Dominik Wagner et al.

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

by Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

First submitted to arxiv on: 21 Mar 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper investigates ways to make interactions with virtual assistants more intuitive by eliminating the requirement for a trigger phrase. Researchers explore three approaches: using acoustic information from audio waveforms, decoder outputs from automatic speech recognition (ASR) systems as input features to large language models (LLMs), and multimodal systems combining acoustic, lexical, and ASR signals in LLMs. The multimodal approach achieves significant relative equal-error-rate improvements of up to 39% and 61% compared to text-only and audio-only models.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper looks into making interactions with virtual assistants more natural by getting rid of the need for a specific trigger phrase. They try three ideas: using just what sounds like noise in the audio, the top guesses from automatic speech recognition (ASR) systems as clues for big language models (LLMs), and a system that uses all kinds of information – sound, words, and ASR hints – to help LLMs make sense of commands. This way of combining different types of information makes a big difference, with results up to 39% and 61% better than just using text or audio.

Keywords

* Artificial intelligence * Decoder

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

by Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Glc++: Source-free Universal Domain Adaptation Through Global-local Clustering and Contrastive Affinity Learning, by Sanqing Qu et al.

Summary of Dp-rdm: Adapting Diffusion Models to Private Domains Without Fine-tuning, by Jonathan Lebensold et al.

Related Posts