Summary of Vl-cache: Sparsity and Modality-aware Kv Cache Compression For Vision-language Model Inference Acceleration, by Dezhan Tu et al.
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu
First submitted to arxiv on: 29 Oct 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Vision-Language Models (VLMs) have shown impressive performance across various tasks. To accelerate VLM inference, a key challenge is storing and accessing large Key-Value (KV) caches that encode visual contexts like images or videos. Existing KV cache compression methods are effective for Large Language Models (LLMs), but directly migrating them to VLMs yields suboptimal accuracy and speedup. This paper proposes VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. The authors investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. They introduce layer-adaptive sparsity-aware cache budget allocation to effectively distribute the limited cache budget across different layers, reducing KV cache size without compromising accuracy. Additionally, they develop a modality-aware token scoring policy to evaluate token importance. Experimental results on multiple benchmark datasets demonstrate that retaining only 10% of KV cache achieves comparable accuracy to full cache. The proposed method accelerates end-to-end latency by up to 2.33x and speeds up decoding by up to 7.08x, while reducing memory footprint in GPU by 90%. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine trying to understand a long video or image by looking at small pieces of it. This is like what computers do when they process visual information. The problem is that the computer needs a lot of memory to store these small pieces, called “key-value” caches. Researchers have been working on ways to make this more efficient. In this paper, scientists propose a new way to compress these key-value caches so that computers can process images and videos faster and use less memory. They did this by looking at how computers attend to different parts of an image or video, and then allocating the limited memory space accordingly. This method was tested on several datasets and showed great results, achieving similar accuracy with much less memory used. |
Keywords
» Artificial intelligence » Attention » Inference » Token