Summary of Resq: Mixed-precision Quantization Of Large Language Models with Low-rank Residuals, by Utkarsh Saxena et al.

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

by Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang

First submitted to arxiv on: 18 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary In this paper, the authors propose a novel method for post-training quantization (PTQ) of large language models (LLMs), which reduces computational cost at inference time without compromising generalizability. The PTQ challenge lies in quantizing all weight, activation, and key-value cache tensors to 4-bit, mitigating the high quantization error caused by extreme outliers in activations. To address this issue, ResQ identifies a low-rank subspace using principal component analysis (PCA) and keeps coefficients within this subspace in high precision (8-bit), while quantizing the rest to 4-bit. Invariant random rotation is applied within each subspace to further suppress outliers. The authors demonstrate that ResQ outperforms recent uniform and mixed-precision PTQ methods on various benchmarks, achieving up to 33% lower perplexity on Wikitext than SpinQuant and a 3x speedup over the 16-bit baseline.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper explores ways to make large language models more efficient. Right now, these models are very good at understanding human language, but they’re also very slow and use a lot of computer power. The researchers developed a new method called ResQ that can help fix this problem by making the models run faster without losing any of their ability to understand language. They did this by finding a special way to represent some of the information in the model using fewer “bits” (like 1s and 0s) than usual, which makes it run faster.

Keywords

* Artificial intelligence * Inference * Pca * Perplexity * Precision * Principal component analysis * Quantization

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

by Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Improving Generalization Performance Of Yolov8 For Camera Trap Object Detection, by Aroj Subedi

Summary of Drivegpt: Scaling Autoregressive Behavior Models For Driving, by Xin Huang et al.

Related Posts