Loading Now

Summary of Corm: Cache Optimization with Recent Message For Large Language Model Inference, by Jincheng Dai et al.


CORM: Cache Optimization with Recent Message for Large Language Model Inference

by Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen, Deng Cai, Wei Bi, Shuming Shi

First submitted to arxiv on: 24 Apr 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large Language Models (LLMs) have achieved impressive performance on various tasks, but they require significant computational resources and GPU memory. The KV cache, which stores key-value pairs, consumes a substantial portion of this memory, particularly when processing long sequences. To address this issue, we propose CORM, a novel method for optimizing the KV cache, which reduces its memory footprint by up to 70% while maintaining performance across six tasks in LongBench. By leveraging the similarities between adjacent tokens’ query vectors and the attention calculation of preceding queries, CORM dynamically retains essential key-value pairs for inference without requiring model fine-tuning. This approach is compatible with GQA for further compression rate.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about finding a way to make large language models use less memory on computers. These models are good at doing tasks like understanding language, but they take up a lot of space and energy. The problem is that the model’s “cache” (like a temporary storage) uses up too much memory when processing long texts. The researchers came up with an innovative solution called CORM that helps reduce this memory usage without sacrificing performance. They found that similar patterns exist in the data, which allows them to keep only the most important information. This approach works well on six different tasks and can even be used with another technique for further compression.

Keywords

» Artificial intelligence  » Attention  » Fine tuning  » Inference