Loading Now

Summary of Mitigating Object Hallucinations in Large Vision-language Models with Assembly Of Global and Local Attention, by Wenbin An et al.


Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

by Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, Shijian Lu

First submitted to arxiv on: 18 Jun 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A large vision-language model (LVLM) may generate textual responses that are inconsistent with the actual objects in images, a phenomenon known as object hallucination. This issue is rooted in the model’s deficient attention on discriminative image features, where it primarily attends to prompt-irrelevant global features rather than prompt-relevant local features, undermining its visual grounding capacity and leading to hallucinations. The proposed Assembly of Global and Local Attention (AGLA) approach addresses this issue by assembling global features for response generation and local features for visual discrimination simultaneously. AGLA is a training-free and plug-and-play method that uses an image-prompt matching scheme to capture prompt-relevant local features from images, allowing the model to suppress irrelevant distractions and highlight relevant content. The calibrated logit distribution generated by AGLA combines generative global features of the original image with discriminative local features of the augmented image, effectively mitigating hallucinations. Experimental results demonstrate the superiority of AGLA in LVLM hallucination mitigation, highlighting its applicability across both discriminative and generative tasks.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large vision-language models (LVLMs) are very smart computers that can understand images and text. Sometimes, they make mistakes by describing things in an image that aren’t really there. This is called “hallucinations.” The problem happens because the model looks at the wrong parts of the picture instead of focusing on what’s important. A new approach called Assembly of Global and Local Attention (AGLA) helps fix this issue. It works by looking at both the big picture and the small details, so the model can describe things accurately. This is a game-changer for computers that understand images and text!

Keywords

» Artificial intelligence  » Attention  » Grounding  » Hallucination  » Language model  » Prompt