Summary of Towards the Next Frontier in Speech Representation Learning Using Disentanglement, by Varun Krishna and Sriram Ganapathy
Towards the Next Frontier in Speech Representation Learning Using Disentanglement
by Varun Krishna, Sriram Ganapathy
First submitted to arxiv on: 2 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes a novel framework for learning self-supervised speech representations, which consists of two encoder modules: a frame-level and an utterance-level module. The frame-level module is inspired by existing self-supervision techniques and learns pseudo-phonemic representations, while the utterance-level module uses constrastive learning to learn pseudo-speaker representations. The two encoders are jointly learned using a mutual information-based criterion, with the goal of disentangling their representations. The proposed framework, termed Learn2Diss, is evaluated on several downstream tasks and achieves state-of-the-art results. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper introduces a new way to learn speech representations that focuses on both short-term and long-term patterns in speech. Instead of just looking at individual sounds or frames, the approach considers the speaker’s characteristics and other consistent features throughout an entire sentence. This helps improve performance on tasks like recognizing what someone is saying, as well as understanding their tone and emotions. |
Keywords
* Artificial intelligence * Encoder * Self supervised