Loading Now

Summary of Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models, by Jean Park et al.


Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

by Jean Park, Kuk Jin Jang, Basam Alasaly, Sriharsha Mopidevi, Andrew Zolensky, Eric Eaton, Insup Lee, Kevin Johnson

First submitted to arxiv on: 22 Aug 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a novel method for identifying unimodal bias in video question-answering (VidQA) benchmarks and datasets, which often prioritize a single modality over others. The authors introduce the modality importance score (MIS), a measure that assesses which modality embeds the necessary information to answer questions. They also propose an innovative method using state-of-the-art multimodal large language models (MLLMs) to estimate MIS, serving as a proxy for human judgments of modality perception. The results demonstrate the presence of unimodal bias and scarcity of genuinely multimodal questions in existing datasets. Ablation studies evaluate the performance of MLLMs on permuted feature sets, indicating that current models do not effectively integrate information due to modality imbalance. The proposed MIS can guide the curation of modality-balanced datasets, advancing multimodal learning and enhancing MLLMs’ capabilities.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about a way to make sure that video question-answering (VidQA) tests are fair by measuring which parts of the test are most important. Right now, many VidQA tests are biased towards just one type of information, like text or images, instead of using all types together. The researchers created a new tool called the modality importance score (MIS) to help fix this problem. They also came up with a way to use special computer models called multimodal large language models (MLLMs) to estimate how important each type of information is. The results show that many VidQA tests are biased and don’t really ask for all types of information together. By using the MIS, we can create new tests that require all types of information to be used together, which will help make computer models better at understanding different types of data.

Keywords

* Artificial intelligence  * Question answering