Summary of How Does Critical Batch Size Scale in Pre-training?, by Hanlin Zhang et al.

How Does Critical Batch Size Scale in Pre-training?

by Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade

First submitted to arxiv on: 29 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The proposed paper examines the efficiency notion of critical batch size (CBS) in training large-scale models under given resources. CBS marks the threshold beyond which greater data parallelism leads to diminishing returns. The authors propose a measure of CBS and pre-train a series of auto-regressive language models on the C4 dataset, investigating the impact of scale on CBS through extensive hyper-parameter sweeps. They find that CBS scales primarily with data size rather than model size, justified theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. The paper also highlights the importance of common hyper-parameter choices for studying large-scale pre-training beyond fixed training durations.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Training big models efficiently is important. Researchers have found that there’s a point where making them bigger doesn’t help much, but they don’t know exactly when that happens. To figure this out, scientists propose a way to measure this point and train many language models with different numbers of parameters. They test how well these models do under different conditions and find that the point where bigger isn’t better is mainly based on how much data you have, not how complex your model is. This helps us understand why some models work better than others.

Keywords

* Artificial intelligence * Regression

How Does Critical Batch Size Scale in Pre-training?

by Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of The Effects Of Multi-task Learning on Relu Neural Network Functions, by Julia Nakhleh et al.

Summary of Cfsafety: Comprehensive Fine-grained Safety Assessment For Llms, by Zhihao Liu et al.

Related Posts