Loading Now

Summary of Prefixquant: Eliminating Outliers by Prefixed Tokens For Large Language Models Quantization, By Mengzhao Chen et al.


PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization

by Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo

First submitted to arxiv on: 7 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Existing weight-activation quantization methods for Large Language Models (LLMs) mainly focus on channel-wise outliers, neglecting token-wise outliers that impact the accuracy of quantized models. This paper proposes PrefixQuant, a novel method that achieves state-of-the-art performance across various precision levels and granularities by effectively isolating token-wise outliers. PrefixQuant eliminates token-wise outliers through prefixing outlier tokens in the KV cache, which is training-free and efficient. It also introduces new trainable parameters for block-wise training to compensate for quantization error. Experiments show that PrefixQuant outperforms existing dynamic quantization methods under both dynamic and static quantization settings. For instance, it achieves an average accuracy improvement of +3.08 points over SpinQuant on five zero-shot reasoning tasks under dynamic quantization. Additionally, PrefixQuant demonstrates up to 2.74x prefilling speedup and 2.16x decoding speedup for LLMs using W4A4. The code is available at this GitHub link.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making computers understand language better. Right now, there are some methods that help computers do this, but they don’t work well when the computer has to make quick decisions. This new method, called PrefixQuant, can make these decisions faster and more accurately. It does this by identifying mistakes in how the computer stores information and then making adjustments to fix those mistakes. The results show that PrefixQuant works better than other methods, especially when the computer is working quickly. This could help computers understand language even better, which could be very useful for things like chatbots or virtual assistants.

Keywords

» Artificial intelligence  » Precision  » Quantization  » Token  » Zero shot