Loading Now

Summary of Kivi: a Tuning-free Asymmetric 2bit Quantization For Kv Cache, by Zirui Liu et al.


KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

by Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu

First submitted to arxiv on: 5 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG); Performance (cs.PF)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper explores the challenges of efficiently serving large language models (LLMs) and proposes a solution to reduce the memory demands and computational costs associated with the key-value (KV) cache. The KV cache stores attention keys and values to avoid re-computations, but as batch sizes increase and context lengths grow, the KV cache becomes a bottleneck in speed and memory usage. To address this issue, the authors conduct a comprehensive study on the element distribution of KV caches in popular LLMs and develop a tuning-free 2-bit KV cache quantization algorithm named KIVI. This algorithm enables models to maintain quality while using significantly less peak memory, enabling up to 4x larger batch sizes and resulting in significant throughput improvements.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models (LLMs) need to process many requests at once to be efficient. This is called batching. However, as the size of these batches grows, so does the amount of memory needed for something called the key-value (KV) cache. The KV cache helps by storing important information that doesn’t need to be computed again. But this means more memory is used, and it becomes a problem. To solve this, researchers studied how the KV cache works in popular LLMs. They found that some parts of the cache can be simplified without losing accuracy. This led to an easy-to-use 2-bit quantization algorithm called KIVI. It helps models use less memory while still working well.

Keywords

* Artificial intelligence  * Attention  * Quantization