Loading Now

Summary of Multi-stage Multi-modal Pre-training For Automatic Speech Recognition, by Yash Jain et al.


Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

by Yash Jain, David Chan, Pranav Dheram, Aparna Khare, Olabanji Shonibare, Venkatesh Ravichandran, Shalini Ghosh

First submitted to arxiv on: 28 Mar 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces a novel method that combines multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach to improve automatic speech recognition (ASR) performance. By fine-tuning on uni-modal tasks, the proposed method demonstrates significant improvements over baselines, achieving relative word error rate (WER) reductions of up to 38.45% on Librispeech and SUPERB datasets.
Low GrooveSquid.com (original content) Low Difficulty Summary
The authors’ innovative approach uses multi-stage pre-training, which includes single-stage pre-training with a single unsupervised task followed by mid-training using a translation-based supervised method. This leads to improved ASR performance compared to existing methods that only use single-stage pre-training. The paper also provides insights on choosing the right pre-training methods and datasets.

Keywords

» Artificial intelligence  » Fine tuning  » Multi modal  » Multi task  » Supervised  » Translation  » Unsupervised