Summary of In-context Kv-cache Eviction For Llms Via Attention-gate, by Zihao Zeng et al.

In-context KV-Cache Eviction for LLMs via Attention-Gate

by Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng

First submitted to arxiv on: 15 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The KV-Cache technique has become the standard for large language model inference, caching states to avoid recomputation. However, it’s criticized as a bottleneck, especially with ultra-large models and long-context queries. This paper addresses this gap by devising Attention-Gate, a parameterized mechanism that accepts context as input and yields eviction flags for each token. The subsequent self-attention module proceeds according to the flags, caching KV states only for remaining tokens. Attention-Gates can vary among heads and layers, plugged into pre-trained LLMs through continual pre-training or supervised fine-tuning objectives. Validation across multiple tasks demonstrates efficiency and adaptability, outperforming LoRA-finetuned LLMs on some datasets.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Large language models (LLMs) have become incredibly powerful tools for processing natural language data. However, they can be slow to use because they need to remember every piece of information they’ve seen so far. This paper shows how to make LLMs faster by throwing away the things they don’t need. It’s like a garbage collector for your computer, but instead of deleting files and programs, it gets rid of the things that LLM doesn’t really care about. The new method is called Attention-Gate and it works by looking at the whole sentence or paragraph to decide what’s important and what can be forgotten.

Keywords

» Artificial intelligence » Attention » Fine tuning » Inference » Large language model » Lora » Self attention » Supervised » Token

In-context KV-Cache Eviction for LLMs via Attention-Gate

by Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Towards Homogeneous Lexical Tone Decoding From Heterogeneous Intracranial Recordings, by Di Wu et al.

Summary of Sset: Swapping-sliding Explanation For Time Series Classifiers in Affect Detection, by Nazanin Fouladgar et al.

Related Posts