Summary of Revisiting Moe and Dense Speed-accuracy Comparisons For Llm Training, by Xianzhi Du et al.
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training
by Xianzhi Du, Tom Gunter, Xiang Kong, Mark Lee, Zirui Wang, Aonan Zhang, Nan Du, Ruoming Pang
First submitted to arxiv on: 23 May 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the performance of Mixture-of-Experts (MoE) models compared to dense models. The authors argue that previous studies favor MoE by using FLOPs or activated parameters as measures of model complexity, which underestimate the communication overhead in sparse layers. To provide a more accurate comparison, they adopt step time as a measure of model complexity and determine the total compute budget under Chinchilla settings. To efficiently run MoE on modern accelerators, they use a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range. The authors evaluate MoE and dense LLMs on various English tasks, including 0-shot, 1-shot, MMLU 5-shot, and GSM8K 8-shot across three model scales. Experimental results show that MoE consistently outperforms dense LLMs on the speed-accuracy trade-off curve with meaningful gaps. The paper provides an implementation of the full model and sharding strategy at this URL: https://github.com/apple/axlearn. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This research paper is about comparing two types of artificial intelligence models called Mixture-of-Experts (MoE) and dense models. MoE models can be faster than dense models, but previous studies didn’t account for how they communicate with each other. The authors in this study want to make a fair comparison by considering how much the models need to “talk” to each other. They also found a way to make MoE work efficiently on modern computers. The results show that MoE is faster and more accurate than dense models, especially when dealing with large amounts of data. |
Keywords
» Artificial intelligence » 1 shot » Mixture of experts