Summary of Mixture Compressor For Mixture-of-experts Llms Gains More, by Wei Huang et al.
Mixture Compressor for Mixture-of-Experts LLMs Gains More
by Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi
First submitted to arxiv on: 8 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A significant advancement in large language models (LLMs) is the Mixture-of-Experts architecture, which combines multiple experts to tackle various tasks. However, this approach faces two major hurdles: high memory consumption and slow loading times due to expert parameters, as well as redundant activated experts that only require a single expert. To overcome these issues, researchers investigate MoE-LLMs and discover varying behaviors among experts in terms of activation reconstruction error, routing scores, and activated frequencies. They also find that not all tokens are equally important, with only a small subset being critical. Building on these insights, the authors propose MC (Mixture-Compressor), a training-free approach that leverages expert and token significance to achieve extreme compression. This is achieved through Pre-Loading Mixed-Precision Quantization, which optimizes adaptive bit-width allocation using Linear Programming, and Online Dynamic Pruning, which identifies important tokens and dynamically selects activated experts during inference. The authors demonstrate the effectiveness of MC by compressing MoE-LLMs with minimal accuracy loss, achieving a 76.6% compression rate at 2.54 bits while maintaining performance. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models are getting smarter, but they need help to work efficiently on our computers! Researchers have found that these models use too many experts and store too much information, making them slow and memory-hungry. They discovered that not all parts of the model are equally important, so they developed a new way to compress the model without sacrificing its ability to learn. This new method, called MC, is like a filter that reduces the amount of information stored while keeping the most critical parts intact. By using this approach, the researchers were able to shrink the model size by 76% without losing too much accuracy! |
Keywords
» Artificial intelligence » Inference » Mixture of experts » Precision » Pruning » Quantization » Token