Summary of Dense Training, Sparse Inference: Rethinking Training Of Mixture-of-experts Language Models, by Bowen Pan et al.
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
by Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, Rameswar Panda
First submitted to arxiv on: 8 Apr 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4 times compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. To achieve comparable performance to a dense model, MoE models generally require 2-4 times more parameters, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. A proposed hybrid dense training and sparse inference framework for MoE models (DS-MoE) achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Experiments on training large language models demonstrate that DS-MoE models are more parameter-efficient than standard sparse MoEs, and comparable in terms of total parameter size and performance to dense models while being computationally cheaper (activating 30-40% of the model’s parameters). Performance tests using vLLM show that DS-MoE-6B runs up to 1.86 times faster than similar dense models like Mistral-7B, and between 1.50-1.71 times faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary MoE language models can save time and energy by using less computer power. However, they need more memory to work well, which makes them less efficient when the computer is busy doing other tasks. To fix this problem, a new way of training MoE models was proposed. This approach uses all the model’s parts during training, but only uses some parts during actual use. The results show that this method is more efficient in terms of memory and speed than traditional methods. |
Keywords
* Artificial intelligence * Autoregressive * Inference * Mixture of experts * Parameter efficient