Summary of Resolving Discrepancies in Compute-optimal Scaling Of Language Models, by Tomer Porian et al.
Resolving Discrepancies in Compute-Optimal Scaling of Language Models
by Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon
First submitted to arxiv on: 27 Jun 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the discrepancy between two influential scaling laws in machine learning: Kaplan et al.’s and Hoffmann et al.’s. The authors find that the difference arises from three factors: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. By correcting these factors, they show excellent agreement with the Hoffmann et al. (Chinchilla) scaling law. Additionally, they derive scaling laws for optimal learning rate and batch size, finding that AdamW’s beta-2 parameter is crucial at lower batch sizes. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps us understand why two important machine learning rules don’t agree. It looks at two big ideas: how to make a model bigger or smaller based on how much computer power you have (Kaplan et al.) and another idea that says the best model size is when you use just enough computer power (Hoffmann et al.). The researchers find out what makes these ideas different by trying them out on two big datasets. They show that making small changes to how these ideas work can make them agree better. This helps us understand why some machine learning models work well and others don’t. |
Keywords
» Artificial intelligence » Machine learning » Scaling laws