Loading Now

Summary of Matryoshkakv: Adaptive Kv Compression Via Trainable Orthogonal Projection, by Bokai Lin et al.


MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

by Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng

First submitted to arxiv on: 16 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper focuses on improving the efficiency of large language models (LLMs) by optimizing the key-value (KV) cache, a technique commonly used in LLM inference. The KV cache can become a bottleneck as model sizes increase, and previous studies have primarily focused on compressing the first three axes of the cache tensors. This work supplements these efforts by targeting the feature dimension axis and using low-rank projection matrices to reduce dimensions. The authors investigate the canonical orthogonal projection method using principal component analysis (PCA) but observe significant performance degradation at low compression rates. To address this, they propose a distillation objective-based training strategy that directly tunes orthogonal projection matrices using an elaborate Matryoshka training approach. This allows for adaptive searching of optimal compression rates for various layers and heads given varying budgets. The authors demonstrate the efficacy of their method by achieving over 90% performance with an average KV cache compression rate of 60% (up to 75% in certain scenarios) for popular LLMs like LLaMA2-7B-base and Mistral-7B-v0.3-base.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making language models more efficient so they can process lots of data quickly. Right now, these models need a lot of space to store information from previous calculations, which can slow them down. Researchers have been trying to fix this by compressing the model’s memory, but most efforts have focused on the first three parts of the memory. This paper takes a different approach and tries to squeeze more data into the remaining part of the memory. They use special techniques called principal component analysis (PCA) and distillation to make sure the model still works well even when it has less storage space.

Keywords

» Artificial intelligence  » Distillation  » Inference  » Pca  » Principal component analysis