Summary of Adaptive Batch Size Schedules For Distributed Training Of Language Models with Data and Model Parallelism, by Tim Tsz-kit Lau et al.

Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism

by Tim Tsz-Kit Lau, Weijian Li, Chenwei Xu, Han Liu, Mladen Kolar

First submitted to arxiv on: 30 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The proposed work addresses the dilemma of choosing batch sizes in large-scale model training for language models, where large-batch training improves efficiency but generalization performance deteriorates due to gradient noise. The authors criticize current practices that prioritize training efficiency and propose theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. These approaches are implemented with PyTorch Fully Sharded Data Parallel and empirically demonstrated to outperform constant batch sizes and heuristic warmup schedules in the pretraining of Llama 2 family models, particularly smaller ones with up to 3 billion parameters. Theoretical convergence guarantees are established for these adaptive batch size schedules with Adam for general smooth nonconvex objectives.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper is about finding a good way to train large language models without wasting too much computer memory. Right now, people often choose the wrong batch sizes because they prioritize making their computers work faster rather than getting better results. The authors of this paper think that’s a bad idea and want to come up with a new method that works better. They propose some new ways to train models and show that these methods work better in practice. This is important because it could help people train even bigger and more powerful language models.

Keywords

» Artificial intelligence » Generalization » Llama » Pretraining

Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism

by Tim Tsz-Kit Lau, Weijian Li, Chenwei Xu, Han Liu, Mladen Kolar

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Text Classification: Neural Networks Vs Machine Learning Models Vs Pre-trained Models, by Christos Petridis

Summary of Generating Explainable Rule Sets From Tree-ensemble Learning Methods by Answer Set Programming, By Akihiro Takemura et al.

Related Posts