Loading Now

Summary of Multi-modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers, By Minoo Shayaninasab et al.


Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers

by Minoo Shayaninasab, Bagher Babaali

First submitted to arxiv on: 11 Feb 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper presents a novel approach to multimodal emotion recognition by leveraging three input modalities – text, audio (speech), and video. The authors utilize pre-trained Transformer models with fine-tuning to generate feature vectors for each modality, which are then fused together using various techniques. After experimenting with different fusion methods and classifiers, the best model is found to be a combination of feature-level fusion by concatenating feature vectors and classification using a Support Vector Machine on the IEMOCAP dataset, achieving an accuracy of 75.42%. This research demonstrates the effectiveness of multimodal emotion recognition using Transformer-based models.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper shows how to recognize emotions in people using three different ways: text, speech, and video. The researchers use special computer models called Transformers to help with this task. They combine information from all three sources to get a better understanding of what someone is feeling. By trying out different ways to join the information together, they found that one method worked best – combining the information at the feature level and then using a special kind of machine learning called Support Vector Machines. This approach was able to correctly identify emotions 75.42% of the time.

Keywords

» Artificial intelligence  » Classification  » Fine tuning  » Machine learning  » Support vector machine  » Transformer