Summary of Multi-modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers, By Minoo Shayaninasab et al.
Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers
by Minoo Shayaninasab, Bagher Babaali
First submitted to arxiv on: 11 Feb 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper presents a novel approach to multimodal emotion recognition by leveraging three input modalities – text, audio (speech), and video. The authors utilize pre-trained Transformer models with fine-tuning to generate feature vectors for each modality, which are then fused together using various techniques. After experimenting with different fusion methods and classifiers, the best model is found to be a combination of feature-level fusion by concatenating feature vectors and classification using a Support Vector Machine on the IEMOCAP dataset, achieving an accuracy of 75.42%. This research demonstrates the effectiveness of multimodal emotion recognition using Transformer-based models. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper shows how to recognize emotions in people using three different ways: text, speech, and video. The researchers use special computer models called Transformers to help with this task. They combine information from all three sources to get a better understanding of what someone is feeling. By trying out different ways to join the information together, they found that one method worked best – combining the information at the feature level and then using a special kind of machine learning called Support Vector Machines. This approach was able to correctly identify emotions 75.42% of the time. |
Keywords
» Artificial intelligence » Classification » Fine tuning » Machine learning » Support vector machine » Transformer