Loading Now

Summary of Resq: Mixed-precision Quantization Of Large Language Models with Low-rank Residuals, by Utkarsh Saxena et al.


ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

by Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang

First submitted to arxiv on: 18 Dec 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
In this paper, the authors propose a novel method for post-training quantization (PTQ) of large language models (LLMs), which reduces computational cost at inference time without compromising generalizability. The PTQ challenge lies in quantizing all weight, activation, and key-value cache tensors to 4-bit, mitigating the high quantization error caused by extreme outliers in activations. To address this issue, ResQ identifies a low-rank subspace using principal component analysis (PCA) and keeps coefficients within this subspace in high precision (8-bit), while quantizing the rest to 4-bit. Invariant random rotation is applied within each subspace to further suppress outliers. The authors demonstrate that ResQ outperforms recent uniform and mixed-precision PTQ methods on various benchmarks, achieving up to 33% lower perplexity on Wikitext than SpinQuant and a 3x speedup over the 16-bit baseline.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper explores ways to make large language models more efficient. Right now, these models are very good at understanding human language, but they’re also very slow and use a lot of computer power. The researchers developed a new method called ResQ that can help fix this problem by making the models run faster without losing any of their ability to understand language. They did this by finding a special way to represent some of the information in the model using fewer “bits” (like 1s and 0s) than usual, which makes it run faster.

Keywords

» Artificial intelligence  » Inference  » Pca  » Perplexity  » Precision  » Principal component analysis  » Quantization