Loading Now

Summary of The Instinctive Bias: Spurious Images Lead to Illusion in Mllms, by Tianyang Han et al.


The Instinctive Bias: Spurious Images lead to Illusion in MLLMs

by Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe Diao, Yong Lin, Tong Zhang

First submitted to arxiv on: 6 Feb 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Recent advances in large language models (LLMs) have led to impressive performances in various multi-modal tasks, thanks to the advent of multi-modal LLMs (MLLMs). However, these powerful MLLMs, such as GPT-4V, still struggle when presented with certain image and text inputs. This paper identifies a class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers, causing MLLMs to suffer from visual illusion. To quantify this effect, the authors propose CorrelationQA, a benchmark that assesses the visual illusion level given spurious images. The CorrelationQA benchmark contains 7,308 text-image pairs across 13 categories and is used to evaluate 9 mainstream MLLMs, showing that they universally suffer from this instinctive bias to varying degrees. This study aims to aid in better assessments of MLLMs’ robustness in the presence of misleading images.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper talks about how big language models can get confused when shown certain pictures and words. These models are great at doing tasks that involve text and images, but they still make mistakes. The authors found a type of picture and word combination that makes these models really bad at guessing what the correct answer is. To understand this problem better, the authors created a special test called CorrelationQA to measure how well different language models do with these tricky combinations. They tested 9 popular models and found that they all make mistakes in some way.

Keywords

* Artificial intelligence  * Gpt  * Multi modal