Summary of Enhancing Instruction-following Capability Of Visual-language Models by Reducing Image Redundancy, By Te Yang et al.

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

by Te Yang, Jian Jia, Xiangyu Zhu, Weisong Zhao, Bo Wang, Yanhua Cheng, Yan Li, Shengyuan Liu, Quan Chen, Peng Jiang, Kun Gai, Zhen Lei

First submitted to arxiv on: 23 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper proposes strategies to enhance the instruction-following capabilities of Multimodal Large Language Models (MLLMs) without compromising their multimodal understanding capacity. The authors demonstrate that spatially down-sampling visual tokens improves MLLMs’ instruction-following abilities, but this comes at the cost of impaired multimodal understanding. To bridge the gap between MLLMs and Large Language Models (LLMs), the researchers introduce Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies. VMTC condenses redundant visual tokens by token clustering and merging, while CMAI inhibits attention on text-image token pairs with low focus scores. The proposed methods are evaluated on five benchmarks: VQA-V2, GQA, TextVQA, MME, and MMBench. Results show that the strategies significantly enhance MLLMs’ instruction-following capabilities while preserving their multimodal understanding abilities.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper tries to make Multimodal Large Language Models (MLLMs) better at following instructions without losing their ability to understand different types of information. Currently, MLLMs are not as good as other models at following instructions. To fix this, the researchers came up with two new ideas: Visual-Modality Token Compression and Cross-Modality Attention Inhibition. These methods help MLLMs get better at following instructions by getting rid of extra visual information that’s not important. The authors tested these strategies on several tasks and showed that they work well.

Keywords

* Artificial intelligence * Attention * Clustering * Token

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

by Te Yang, Jian Jia, Xiangyu Zhu, Weisong Zhao, Bo Wang, Yanhua Cheng, Yan Li, Shengyuan Liu, Quan Chen, Peng Jiang, Kun Gai, Zhen Lei

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Adversarial Prompt Distillation For Vision-language Models, by Lin Luo et al.

Summary of Enhancing Grammatical Error Detection Using Bert with Cleaned Lang-8 Dataset, by Rahul Nihalani et al.

Related Posts