Loading Now

Summary of Listen Then See: Video Alignment with Speaker Attention, by Aviral Agrawal et al.


Listen Then See: Video Alignment with Speaker Attention

by Aviral Agrawal, Carlos Mateo Samudio Lezcano, Iqui Balam Heredia-Marin, Prabhdeep Singh Sethi

First submitted to arxiv on: 21 Apr 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Video-based Question Answering (Video QA) is a challenging task that becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but also processing nuanced human behavior. The paper introduces a cross-modal alignment and subsequent representation fusion approach to achieve state-of-the-art results on the Social IQ 2.0 dataset for SIQA. This approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality, reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current techniques.
Low GrooveSquid.com (original content) Low Difficulty Summary
Video-based Question Answering is a tough task that gets even harder when it comes to answering questions about people. This requires understanding what’s happening in videos, how things are related in time, and using information from different senses like sight, sound, and language. The problem is made harder by the fact that text is often more important than other types of information. To solve this challenge, researchers developed a new way to combine different kinds of information together, which worked better than previous methods.

Keywords

» Artificial intelligence  » Alignment  » Overfitting  » Question answering