Loading Now

Summary of Enhancing Instruction-following Capability Of Visual-language Models by Reducing Image Redundancy, By Te Yang et al.


Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

by Te Yang, Jian Jia, Xiangyu Zhu, Weisong Zhao, Bo Wang, Yanhua Cheng, Yan Li, Shengyuan Liu, Quan Chen, Peng Jiang, Kun Gai, Zhen Lei

First submitted to arxiv on: 23 Nov 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper proposes strategies to enhance the instruction-following capabilities of Multimodal Large Language Models (MLLMs) without compromising their multimodal understanding capacity. The authors demonstrate that spatially down-sampling visual tokens improves MLLMs’ instruction-following abilities, but this comes at the cost of impaired multimodal understanding. To bridge the gap between MLLMs and Large Language Models (LLMs), the researchers introduce Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies. VMTC condenses redundant visual tokens by token clustering and merging, while CMAI inhibits attention on text-image token pairs with low focus scores. The proposed methods are evaluated on five benchmarks: VQA-V2, GQA, TextVQA, MME, and MMBench. Results show that the strategies significantly enhance MLLMs’ instruction-following capabilities while preserving their multimodal understanding abilities.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper tries to make Multimodal Large Language Models (MLLMs) better at following instructions without losing their ability to understand different types of information. Currently, MLLMs are not as good as other models at following instructions. To fix this, the researchers came up with two new ideas: Visual-Modality Token Compression and Cross-Modality Attention Inhibition. These methods help MLLMs get better at following instructions by getting rid of extra visual information that’s not important. The authors tested these strategies on several tasks and showed that they work well.

Keywords

» Artificial intelligence  » Attention  » Clustering  » Token