Summary of Mitigating Object Hallucinations in Large Vision-language Models with Assembly Of Global and Local Attention, by Wenbin An et al.

Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

by Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, Shijian Lu

First submitted to arxiv on: 18 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary A large vision-language model (LVLM) may generate textual responses that are inconsistent with the actual objects in images, a phenomenon known as object hallucination. This issue is rooted in the model’s deficient attention on discriminative image features, where it primarily attends to prompt-irrelevant global features rather than prompt-relevant local features, undermining its visual grounding capacity and leading to hallucinations. The proposed Assembly of Global and Local Attention (AGLA) approach addresses this issue by assembling global features for response generation and local features for visual discrimination simultaneously. AGLA is a training-free and plug-and-play method that uses an image-prompt matching scheme to capture prompt-relevant local features from images, allowing the model to suppress irrelevant distractions and highlight relevant content. The calibrated logit distribution generated by AGLA combines generative global features of the original image with discriminative local features of the augmented image, effectively mitigating hallucinations. Experimental results demonstrate the superiority of AGLA in LVLM hallucination mitigation, highlighting its applicability across both discriminative and generative tasks.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Large vision-language models (LVLMs) are very smart computers that can understand images and text. Sometimes, they make mistakes by describing things in an image that aren’t really there. This is called “hallucinations.” The problem happens because the model looks at the wrong parts of the picture instead of focusing on what’s important. A new approach called Assembly of Global and Local Attention (AGLA) helps fix this issue. It works by looking at both the big picture and the small details, so the model can describe things accurately. This is a game-changer for computers that understand images and text!

Keywords

* Artificial intelligence * Attention * Grounding * Hallucination * Language model * Prompt

Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

by Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, Shijian Lu

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Breaking the Ceiling Of the Llm Community by Treating Token Generation As a Classification For Ensembling, By Yao-ching Yu et al.

Summary of Latent Intuitive Physics: Learning to Transfer Hidden Physics From a 3d Video, by Xiangming Zhu et al.

Related Posts