Loading Now

Summary of Squeezeattention: 2d Management Of Kv-cache in Llm Inference Via Layer-wise Optimal Budget, by Zihao Wang et al.


SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

by Zihao Wang, Bin Cui, Shaoduo Gan

First submitted to arxiv on: 7 Apr 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper proposes SqueezeAttention, a novel approach to optimize the Key-Value (KV) cache of Large Language Models (LLMs). Existing KV-cache compression algorithms treat all layers equally, allocating the same budget to each layer. However, this approach is suboptimal, as some layers may be less sensitive to input tokens yet still receive the same budget as others. SqueezeAttention identifies the importance of attention layers and optimizes the KV-cache jointly from two dimensions: sequence-wise and layer-wise. The authors propose three representative sequence-wise algorithms to compress the KV-cache for each layer with its own budget, based on the cosine similarity of input prompt differences before and after self-attention layers. By optimizing the KV-cache in both dimensions, SqueezeAttention achieves 30% to 70% memory reductions and up to 2.2 times throughput improvements in various LLMs and benchmarks.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making language models more efficient by reducing the amount of data they need to process. The authors found that some parts of the model are less important than others, so they developed a new way to optimize how these parts use memory. This approach, called SqueezeAttention, helps reduce memory usage and increases the speed at which the model can process information. By doing this, the model can be used in more places and make language processing faster.

Keywords

* Artificial intelligence  * Attention  * Cosine similarity  * Prompt  * Self attention