Summary of Adaptive Batch Size Schedules For Distributed Training Of Language Models with Data and Model Parallelism, by Tim Tsz-kit Lau et al.
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
by Tim Tsz-Kit Lau, Weijian Li, Chenwei Xu, Han Liu, Mladen Kolar
First submitted to arxiv on: 30 Dec 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Optimization and Control (math.OC); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed work addresses the dilemma of choosing batch sizes in large-scale model training for language models, where large-batch training improves efficiency but generalization performance deteriorates due to gradient noise. The authors criticize current practices that prioritize training efficiency and propose theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. These approaches are implemented with PyTorch Fully Sharded Data Parallel and empirically demonstrated to outperform constant batch sizes and heuristic warmup schedules in the pretraining of Llama 2 family models, particularly smaller ones with up to 3 billion parameters. Theoretical convergence guarantees are established for these adaptive batch size schedules with Adam for general smooth nonconvex objectives. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper is about finding a good way to train large language models without wasting too much computer memory. Right now, people often choose the wrong batch sizes because they prioritize making their computers work faster rather than getting better results. The authors of this paper think that’s a bad idea and want to come up with a new method that works better. They propose some new ways to train models and show that these methods work better in practice. This is important because it could help people train even bigger and more powerful language models. |
Keywords
» Artificial intelligence » Generalization » Llama » Pretraining