Summary of In-context Kv-cache Eviction For Llms Via Attention-gate, by Zihao Zeng et al.
In-context KV-Cache Eviction for LLMs via Attention-Gate
by Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng
First submitted to arxiv on: 15 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The KV-Cache technique has become the standard for large language model inference, caching states to avoid recomputation. However, it’s criticized as a bottleneck, especially with ultra-large models and long-context queries. This paper addresses this gap by devising Attention-Gate, a parameterized mechanism that accepts context as input and yields eviction flags for each token. The subsequent self-attention module proceeds according to the flags, caching KV states only for remaining tokens. Attention-Gates can vary among heads and layers, plugged into pre-trained LLMs through continual pre-training or supervised fine-tuning objectives. Validation across multiple tasks demonstrates efficiency and adaptability, outperforming LoRA-finetuned LLMs on some datasets. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models (LLMs) have become incredibly powerful tools for processing natural language data. However, they can be slow to use because they need to remember every piece of information they’ve seen so far. This paper shows how to make LLMs faster by throwing away the things they don’t need. It’s like a garbage collector for your computer, but instead of deleting files and programs, it gets rid of the things that LLM doesn’t really care about. The new method is called Attention-Gate and it works by looking at the whole sentence or paragraph to decide what’s important and what can be forgotten. |
Keywords
» Artificial intelligence » Attention » Fine tuning » Inference » Large language model » Lora » Self attention » Supervised » Token