Summary of Resolving Discrepancies in Compute-optimal Scaling Of Language Models, by Tomer Porian et al.

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

by Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon

First submitted to arxiv on: 27 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the discrepancy between two influential scaling laws in machine learning: Kaplan et al.’s and Hoffmann et al.’s. The authors find that the difference arises from three factors: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. By correcting these factors, they show excellent agreement with the Hoffmann et al. (Chinchilla) scaling law. Additionally, they derive scaling laws for optimal learning rate and batch size, finding that AdamW’s beta-2 parameter is crucial at lower batch sizes.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper helps us understand why two important machine learning rules don’t agree. It looks at two big ideas: how to make a model bigger or smaller based on how much computer power you have (Kaplan et al.) and another idea that says the best model size is when you use just enough computer power (Hoffmann et al.). The researchers find out what makes these ideas different by trying them out on two big datasets. They show that making small changes to how these ideas work can make them agree better. This helps us understand why some machine learning models work well and others don’t.

Keywords

» Artificial intelligence » Machine learning » Scaling laws

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

by Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Alignment For Performance Improvement in Conversation Bots, by Raghav Garg et al.

Summary of Towards Reducing Data Acquisition and Labeling For Defect Detection Using Simulated Data, by Lukas Malte Kemeter et al.

Related Posts