Loading Now

Summary of Towards the Next Frontier in Speech Representation Learning Using Disentanglement, by Varun Krishna and Sriram Ganapathy


Towards the Next Frontier in Speech Representation Learning Using Disentanglement

by Varun Krishna, Sriram Ganapathy

First submitted to arxiv on: 2 Jul 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper proposes a novel framework for learning self-supervised speech representations, which consists of two encoder modules: a frame-level and an utterance-level module. The frame-level module is inspired by existing self-supervision techniques and learns pseudo-phonemic representations, while the utterance-level module uses constrastive learning to learn pseudo-speaker representations. The two encoders are jointly learned using a mutual information-based criterion, with the goal of disentangling their representations. The proposed framework, termed Learn2Diss, is evaluated on several downstream tasks and achieves state-of-the-art results.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper introduces a new way to learn speech representations that focuses on both short-term and long-term patterns in speech. Instead of just looking at individual sounds or frames, the approach considers the speaker’s characteristics and other consistent features throughout an entire sentence. This helps improve performance on tasks like recognizing what someone is saying, as well as understanding their tone and emotions.

Keywords

* Artificial intelligence  * Encoder  * Self supervised