Loading Now

Summary of Channel-wise Mixed-precision Quantization For Large Language Models, by Zihan Chen et al.


Channel-Wise Mixed-Precision Quantization for Large Language Models

by Zihan Chen, Bike Xie, Jundong Li, Cong Shen

First submitted to arxiv on: 16 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large Language Models (LLMs) have achieved great success in various language tasks, but their deployment on edge devices remains challenging due to the large parameter sizes. Weight-only quantization is a promising solution to reduce memory requirements. However, existing approaches primarily focus on integer-bit quantization, limiting adaptability and preventing full utilization of storage space. The paper introduces Channel-Wise Mixed-Precision Quantization (CMPQ), a novel method that allocates precision in a channel-wise pattern based on activation distributions. CMPQ adapts to any bit-width constraint by assigning different precision levels to different weight channels. It employs a non-uniform quantization strategy and two outlier extraction techniques to preserve critical information, minimizing the quantization loss. Experiments demonstrate significant performance gains with a modest increase in memory usage.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large Language Models are very good at doing lots of language tasks, but they need a lot of space to work on devices like phones or tablets. One way to make them use less space is by using “weight-only quantization”. This makes the model smaller without losing its ability to do things. The problem with this method right now is that it only works for certain types of models and doesn’t use all the space available. In this paper, they came up with a new way to make models use less space called Channel-Wise Mixed-Precision Quantization (CMPQ). CMPQ makes sure that different parts of the model are using the right amount of space to work well.

Keywords

» Artificial intelligence  » Precision  » Quantization