Loading Now

Summary of Uncomp: Uncertainty-aware Long-context Compressor For Efficient Large Language Model Inference, by Jing Xiong et al.


UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

by Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong

First submitted to arxiv on: 4 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed uncertainty-aware compression scheme, called UNComp, aims to accelerate large language model (LLM) inference while minimizing performance loss. By leveraging matrix entropy to estimate model uncertainty across layers and heads at the token sequence level, UNComp groups them based on their uncertainty and adaptively compresses both hidden states and key-value caches. This approach achieves a 1.6x speedup in prefilling, reduces cache size by 95.26%, and boosts inference throughput by 6.4x with only a 1.41% performance loss. Moreover, UNComp outperforms full-size caches even when compressed to 9.38% of its original size in needle-in-a-haystack tasks.
Low GrooveSquid.com (original content) Low Difficulty Summary
UNComp is a new way to make large language models work faster and more efficiently. It uses special math to understand how unsure the model is about certain things, and then groups those things together based on that uncertainty. This helps compress information that isn’t as important, making the model run faster without losing much accuracy. In tests, UNComp was able to speed up prefilling by 60%, reduce memory usage by a lot, and even do better than regular models in some tasks. This could be very useful for people who need to use these large language models on computers or servers that have limited resources.

Keywords

» Artificial intelligence  » Inference  » Large language model  » Token