Loading Now

Summary of How Does Critical Batch Size Scale in Pre-training?, by Hanlin Zhang et al.


How Does Critical Batch Size Scale in Pre-training?

by Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade

First submitted to arxiv on: 29 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed paper examines the efficiency notion of critical batch size (CBS) in training large-scale models under given resources. CBS marks the threshold beyond which greater data parallelism leads to diminishing returns. The authors propose a measure of CBS and pre-train a series of auto-regressive language models on the C4 dataset, investigating the impact of scale on CBS through extensive hyper-parameter sweeps. They find that CBS scales primarily with data size rather than model size, justified theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. The paper also highlights the importance of common hyper-parameter choices for studying large-scale pre-training beyond fixed training durations.
Low GrooveSquid.com (original content) Low Difficulty Summary
Training big models efficiently is important. Researchers have found that there’s a point where making them bigger doesn’t help much, but they don’t know exactly when that happens. To figure this out, scientists propose a way to measure this point and train many language models with different numbers of parameters. They test how well these models do under different conditions and find that the point where bigger isn’t better is mainly based on how much data you have, not how complex your model is. This helps us understand why some models work better than others.

Keywords

* Artificial intelligence  * Regression