Summary of Edgeqat: Entropy and Distribution Guided Quantization-aware Training For the Acceleration Of Lightweight Llms on the Edge, by Xuan Shen et al.
EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge
by Xuan Shen, Zhenglun Kong, Changdi Yang, Zhaoyang Han, Lei Lu, Peiyan Dong, Cheng Lyu, Chih-hsiang Li, Xuehang Guo, Zhihao Shu, Wei Niu, Miriam Leeser, Pu Zhao, Yanzhi Wang
First submitted to arxiv on: 16 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes EdgeQAT, a novel approach for optimizing lightweight Large Language Models (LLMs) for inference acceleration on edge devices. Despite the limitations of LLMs in various fields, quantization is commonly adopted to generate efficient models. However, current Post-Training Quantization (PTQ) methods and Quantization-Aware Training (QAT) works suffer from performance degradation when quantizing weights, activations, and KV cache below 8 bits. EdgeQAT addresses this issue by identifying the primary cause of performance drop as information distortion in quantized attention maps. The proposed entropy and distribution guided QAT mitigates this distortion, while a token importance-aware adaptive method dynamically quantizes tokens with different bit widths for further optimization. The framework achieves substantial improvements across various datasets, resulting in an on-device speedup of up to 2.37x compared to its FP16 counterparts. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making language models work better on devices like phones and tablets. Language models are really good at understanding what we say, but they need a lot of power to do it. To fix this, people have been trying to make the models smaller and more efficient. The problem is that when you make them too small, they don’t work very well anymore. This paper proposes a new way to make language models smaller without losing their ability to understand us. They found out that the main problem is that some parts of the model get distorted when it’s made smaller. So, they came up with a new approach that helps fix this distortion. It works really well and makes the models run much faster on our devices. |
Keywords
* Artificial intelligence * Attention * Inference * Optimization * Quantization * Token