Loading Now

Summary of Scaling Laws For Fine-grained Mixture Of Experts, by Jakub Krajewski et al.


Scaling Laws for Fine-Grained Mixture of Experts

by Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur

First submitted to arxiv on: 12 Feb 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
MoE models have become a popular solution for reducing the computational cost of Large Language Models. This paper analyzes their scaling properties by introducing a new hyperparameter, granularity, which allows for precise control over the size of experts. The authors establish scaling laws for fine-grained MoE models, considering factors like training tokens, model size, and granularity. By leveraging these laws, they derive optimal training configurations for given computational budgets. The findings show that MoE models consistently outperform dense Transformers, with the efficiency gap widening as model size and training budget increase. Additionally, the authors demonstrate that setting expert sizes to mirror feed-forward layers is not optimal at most computational budgets.
Low GrooveSquid.com (original content) Low Difficulty Summary
MoE models make language learning faster on computers! This paper looks at how these models work when they’re really big. They add a new trick, called granularity, which lets them control how many “experts” (like mini-models) the main model uses. By using this new tool, the researchers found some rules that show how well MoE models do depending on things like how much training data and how big the model is. This helps us figure out the best way to use these models without wasting computer power. The results show that MoE models are really good at doing what they’re supposed to do, and they get even better as you make them bigger. But, surprisingly, setting up these expert mini-models in a special way isn’t the best approach.

Keywords

* Artificial intelligence  * Hyperparameter  * Scaling laws