Summary of Enhancing Instruction-following Capability Of Visual-language Models by Reducing Image Redundancy, By Te Yang et al.
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy
by Te Yang, Jian Jia, Xiangyu Zhu, Weisong Zhao, Bo Wang, Yanhua Cheng, Yan Li, Shengyuan Liu, Quan Chen, Peng Jiang, Kun Gai, Zhen Lei
First submitted to arxiv on: 23 Nov 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes strategies to enhance the instruction-following capabilities of Multimodal Large Language Models (MLLMs) without compromising their multimodal understanding capacity. The authors demonstrate that spatially down-sampling visual tokens improves MLLMs’ instruction-following abilities, but this comes at the cost of impaired multimodal understanding. To bridge the gap between MLLMs and Large Language Models (LLMs), the researchers introduce Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies. VMTC condenses redundant visual tokens by token clustering and merging, while CMAI inhibits attention on text-image token pairs with low focus scores. The proposed methods are evaluated on five benchmarks: VQA-V2, GQA, TextVQA, MME, and MMBench. Results show that the strategies significantly enhance MLLMs’ instruction-following capabilities while preserving their multimodal understanding abilities. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper tries to make Multimodal Large Language Models (MLLMs) better at following instructions without losing their ability to understand different types of information. Currently, MLLMs are not as good as other models at following instructions. To fix this, the researchers came up with two new ideas: Visual-Modality Token Compression and Cross-Modality Attention Inhibition. These methods help MLLMs get better at following instructions by getting rid of extra visual information that’s not important. The authors tested these strategies on several tasks and showed that they work well. |
Keywords
» Artificial intelligence » Attention » Clustering » Token