Summary of Uncomp: Uncertainty-aware Long-context Compressor For Efficient Large Language Model Inference, by Jing Xiong et al.

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

by Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong

First submitted to arxiv on: 4 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The proposed uncertainty-aware compression scheme, called UNComp, aims to accelerate large language model (LLM) inference while minimizing performance loss. By leveraging matrix entropy to estimate model uncertainty across layers and heads at the token sequence level, UNComp groups them based on their uncertainty and adaptively compresses both hidden states and key-value caches. This approach achieves a 1.6x speedup in prefilling, reduces cache size by 95.26%, and boosts inference throughput by 6.4x with only a 1.41% performance loss. Moreover, UNComp outperforms full-size caches even when compressed to 9.38% of its original size in needle-in-a-haystack tasks.
Low	GrooveSquid.com (original content)	Low Difficulty Summary UNComp is a new way to make large language models work faster and more efficiently. It uses special math to understand how unsure the model is about certain things, and then groups those things together based on that uncertainty. This helps compress information that isn’t as important, making the model run faster without losing much accuracy. In tests, UNComp was able to speed up prefilling by 60%, reduce memory usage by a lot, and even do better than regular models in some tasks. This could be very useful for people who need to use these large language models on computers or servers that have limited resources.

Keywords

» Artificial intelligence » Inference » Large language model » Token

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

by Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Mlp-kan: Unifying Deep Representation and Function Learning, by Yunhong He et al.

Summary of Lorc: Low-rank Compression For Llms Kv Cache with a Progressive Compression Strategy, by Rongzhi Zhang et al.

Related Posts