Summary of An Image Is Worth 1/2 Tokens After Layer 2: Plug-and-play Inference Acceleration For Large Vision-language Models, by Liang Chen et al.
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
by Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang
First submitted to arxiv on: 11 Mar 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The study identifies inefficient attention phenomena in Large Vision-Language Models (LVLMs), specifically in models like LLaVA-1.5, QwenVL-Chat, and Video-LLaVA. It finds that the attention computation over visual tokens is inefficient in deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. The authors introduce FastV, a plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Evaluations demonstrate FastV’s ability to dramatically reduce computational costs (e.g., 45% reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in image and video understanding tasks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The study looks at why some big computer models are wasting energy when they process images. It finds that these models, like LLaVA-1.5, are not using their attention powers efficiently. Attention is important because it helps the model understand what’s important in an image. The researchers created a new way to make these models more efficient called FastV. This method makes the models work faster and use less energy without losing their ability to understand images. This could help make these powerful models available on devices like phones or tablets. |
Keywords
» Artificial intelligence » Attention » Pruning