Summary of Matryoshkakv: Adaptive Kv Compression Via Trainable Orthogonal Projection, by Bokai Lin et al.

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

by Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng

First submitted to arxiv on: 16 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper focuses on improving the efficiency of large language models (LLMs) by optimizing the key-value (KV) cache, a technique commonly used in LLM inference. The KV cache can become a bottleneck as model sizes increase, and previous studies have primarily focused on compressing the first three axes of the cache tensors. This work supplements these efforts by targeting the feature dimension axis and using low-rank projection matrices to reduce dimensions. The authors investigate the canonical orthogonal projection method using principal component analysis (PCA) but observe significant performance degradation at low compression rates. To address this, they propose a distillation objective-based training strategy that directly tunes orthogonal projection matrices using an elaborate Matryoshka training approach. This allows for adaptive searching of optimal compression rates for various layers and heads given varying budgets. The authors demonstrate the efficacy of their method by achieving over 90% performance with an average KV cache compression rate of 60% (up to 75% in certain scenarios) for popular LLMs like LLaMA2-7B-base and Mistral-7B-v0.3-base.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about making language models more efficient so they can process lots of data quickly. Right now, these models need a lot of space to store information from previous calculations, which can slow them down. Researchers have been trying to fix this by compressing the model’s memory, but most efforts have focused on the first three parts of the memory. This paper takes a different approach and tries to squeeze more data into the remaining part of the memory. They use special techniques called principal component analysis (PCA) and distillation to make sure the model still works well even when it has less storage space.

Keywords

» Artificial intelligence » Distillation » Inference » Pca » Principal component analysis

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

by Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Leveraging Intra-period and Inter-period Features For Enhanced Passenger Flow Prediction Of Subway Stations, by Xiannan Huang et al.

Summary of High-dimensional Tensor Discriminant Analysis with Incomplete Tensors, by Elynn Chen et al.

Related Posts