Summary of Grin: Gradient-informed Moe, by Liyuan Liu et al.

GRIN: GRadient-INformed MoE

by Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen

First submitted to arxiv on: 18 Sep 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary A novel approach to scaling up mixture-of-experts (MoE) models is proposed in this paper, which leverages sparse computation through expert routing to selectively activate a small subset of expert modules. However, traditional training practices are challenged by the discrete nature of expert routing, hindering standard backpropagation and gradient-based optimization. To address this issue, the authors introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. The proposed approach is applied to autoregressive language modeling, resulting in a top-2 16×3.8B MoE model that outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
Low	GrooveSquid.com (original content)	Low Difficulty Summary MoE models are a type of deep learning architecture that can be scaled up to process large amounts of data. However, traditional training methods don’t work well with MoE models because they use expert routing, which makes it hard for the model to learn from its mistakes. To fix this problem, researchers developed a new approach called GRIN (GRadient-INformed MoE training). This method helps the MoE model learn more effectively by using sparse gradients and parallel processing. The authors tested their approach on several language modeling tasks and found that it worked really well.

Keywords

* Artificial intelligence * Autoregressive * Backpropagation * Deep learning * Mixture of experts * Optimization * Token

GRIN: GRadient-INformed MoE

by Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Logora: Local-global Representation Alignment For Robust Time Series Classification, by Huanyu Zhang et al.

Summary of Finetuning Language Models to Emit Linguistic Expressions Of Uncertainty, by Arslan Chaudhry et al.

Related Posts