Summary of Hired: Attention-guided Token Dropping For Efficient Inference Of High-resolution Vision-language Models, by Kazi Hasan Ibn Arif et al.
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models
by Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji
First submitted to arxiv on: 20 Aug 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input. To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget. HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly. The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM). We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods. Empirically, HiRED-20% (i.e., a 20% token budget) on LLaVA-Next-7B achieves a 4.7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary High-resolution Vision-Language Models are used to improve accuracy in multimodal tasks. However, these models can generate too many visual tokens, making them hard to process on regular computers. To solve this problem, we created a method called High-Resolution Early Dropping (HiRED). HiRED is a way to reduce the number of visual tokens without sacrificing performance. We showed that HiRED works better than other methods and can speed up processing by 4.7 times while using less memory. |
Keywords
» Artificial intelligence » Attention » Inference » Large language model » Token » Vision transformer » Vit