Summary of Uncomp: Uncertainty-aware Long-context Compressor For Efficient Large Language Model Inference, by Jing Xiong et al.
UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference
by Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong
First submitted to arxiv on: 4 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed uncertainty-aware compression scheme, called UNComp, aims to accelerate large language model (LLM) inference while minimizing performance loss. By leveraging matrix entropy to estimate model uncertainty across layers and heads at the token sequence level, UNComp groups them based on their uncertainty and adaptively compresses both hidden states and key-value caches. This approach achieves a 1.6x speedup in prefilling, reduces cache size by 95.26%, and boosts inference throughput by 6.4x with only a 1.41% performance loss. Moreover, UNComp outperforms full-size caches even when compressed to 9.38% of its original size in needle-in-a-haystack tasks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary UNComp is a new way to make large language models work faster and more efficiently. It uses special math to understand how unsure the model is about certain things, and then groups those things together based on that uncertainty. This helps compress information that isn’t as important, making the model run faster without losing much accuracy. In tests, UNComp was able to speed up prefilling by 60%, reduce memory usage by a lot, and even do better than regular models in some tasks. This could be very useful for people who need to use these large language models on computers or servers that have limited resources. |
Keywords
» Artificial intelligence » Inference » Large language model » Token