Summary of Fusion Approaches For Emotion Recognition From Speech Using Acoustic and Text-based Features, by Leonardo Pepino and Pablo Riera and Luciana Ferrer and Agustin Gravano
Fusion approaches for emotion recognition from speech using acoustic and text-based features
by Leonardo Pepino, Pablo Riera, Luciana Ferrer, Agustin Gravano
First submitted to arxiv on: 27 Mar 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Sound (cs.SD); Audio and Speech Processing (eess.AS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A machine learning study proposes using contextualized word embeddings with BERT to improve emotion classification from speech. The approach combines acoustic and text-based features, demonstrating better performance than traditional methods like Glove embeddings. The paper evaluates different fusion strategies on IEMOCAP and MSP-PODCAST datasets, finding that combining modalities is beneficial but subtle differences exist between approaches. Additionally, the study highlights the importance of cross-validation folds in emotion classification tasks, cautioning against overestimating model performance. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This research explores ways to recognize emotions in spoken language. Scientists developed a new way to represent words in speech transcriptions using BERT, which improved results. They also tested different methods for combining audio and text information and found that mixing the two is helpful. The study used datasets from IEMOCAP and MSP-PODCAST to test the approach. |
Keywords
* Artificial intelligence * Bert * Classification * Glove * Machine learning