Summary of Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models, by Jean Park et al.
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models
by Jean Park, Kuk Jin Jang, Basam Alasaly, Sriharsha Mopidevi, Andrew Zolensky, Eric Eaton, Insup Lee, Kevin Johnson
First submitted to arxiv on: 22 Aug 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel method for identifying unimodal bias in video question-answering (VidQA) benchmarks and datasets, which often prioritize a single modality over others. The authors introduce the modality importance score (MIS), a measure that assesses which modality embeds the necessary information to answer questions. They also propose an innovative method using state-of-the-art multimodal large language models (MLLMs) to estimate MIS, serving as a proxy for human judgments of modality perception. The results demonstrate the presence of unimodal bias and scarcity of genuinely multimodal questions in existing datasets. Ablation studies evaluate the performance of MLLMs on permuted feature sets, indicating that current models do not effectively integrate information due to modality imbalance. The proposed MIS can guide the curation of modality-balanced datasets, advancing multimodal learning and enhancing MLLMs’ capabilities. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about a way to make sure that video question-answering (VidQA) tests are fair by measuring which parts of the test are most important. Right now, many VidQA tests are biased towards just one type of information, like text or images, instead of using all types together. The researchers created a new tool called the modality importance score (MIS) to help fix this problem. They also came up with a way to use special computer models called multimodal large language models (MLLMs) to estimate how important each type of information is. The results show that many VidQA tests are biased and don’t really ask for all types of information together. By using the MIS, we can create new tests that require all types of information to be used together, which will help make computer models better at understanding different types of data. |
Keywords
* Artificial intelligence * Question answering