Loading Now

Summary of “give Me Bf16 or Give Me Death”? Accuracy-performance Trade-offs in Llm Quantization, by Eldar Kurtic et al.


“Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization

by Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh

First submitted to arxiv on: 4 Nov 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper conducts a comprehensive empirical study to evaluate different quantization formats (FP8, INT8, and INT4) on the entire Llama-3.1 model family, focusing on large language model (LLM) inference acceleration. The authors find that FP8 achieves lossless accuracy across all model scales, while well-tuned INT8 incurs only 1-3% accuracy degradation. INT4 weight-only quantization is surprisingly competitive, rivaling 8-bit quantization. The study also analyzes the optimal quantization format for different deployments through the vLLM framework and provides practical guidelines for deploying quantized LLMs at scale.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper looks at how to make large language models run faster without losing accuracy. It tries out three ways of compressing the model: FP8, INT8, and INT4. The results show that one method (FP8) is perfect for all models, while another (INT8) can be a bit slower but still work well. The third method (INT4) is surprisingly good too! The paper also figures out which way of compressing the model works best in different situations and gives practical tips on how to use these findings.

Keywords

» Artificial intelligence  » Inference  » Large language model  » Llama  » Quantization