Summary of Large Vision-language Models As Emotion Recognizers in Context Awareness, by Yuxuan Lei et al.
Large Vision-Language Models as Emotion Recognizers in Context Awareness
by Yuxuan Lei, Dingkang Yang, Zhaoyu Chen, Jiawei Chen, Peng Zhai, Lihua Zhang
First submitted to arxiv on: 16 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Context-aware emotion recognition (CAER) is a complex task that requires perceiving emotions from various contextual cues. Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images, but their knowledge is confined to specific training datasets and may reflect the subjective emotional biases of the annotators. In this paper, we explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task from three paradigms: fine-tuning LVLMs on two CAER datasets, designing zero-shot and few-shot patterns, and incorporating Chain-of-Thought (CoT) into our framework. We develop an image similarity-based ranking algorithm to retrieve examples, which are combined with instructions, retrieved examples, and the test example to feed LVLMs for sentiment judgment. Our extensive experiments demonstrate that LVLMs achieve competitive performance in CAER tasks across different paradigms. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about recognizing emotions from pictures and words. They want to make machines better at understanding how people feel when they see something, like a happy or sad face. The problem is that making these machines learn to recognize emotions can be tricky because it depends on the person who labeled the data. In this research, they try three different ways to make Large Vision-Language Models (LVLMs) better at recognizing emotions. They fine-tune LVLMs with training data, design new ways to use LVLMs without much training, and add a special technique called Chain-of-Thought to help machines reason about emotions. The results show that LVLMs can do well in recognizing emotions from pictures and words. |
Keywords
* Artificial intelligence * Few shot * Fine tuning * Zero shot