Loading Now

Summary of Adaptive Batch Size Schedules For Distributed Training Of Language Models with Data and Model Parallelism, by Tim Tsz-kit Lau et al.


Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism

by Tim Tsz-Kit Lau, Weijian Li, Chenwei Xu, Han Liu, Mladen Kolar

First submitted to arxiv on: 30 Dec 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Optimization and Control (math.OC); Machine Learning (stat.ML)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed work addresses the dilemma of choosing batch sizes in large-scale model training for language models, where large-batch training improves efficiency but generalization performance deteriorates due to gradient noise. The authors criticize current practices that prioritize training efficiency and propose theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. These approaches are implemented with PyTorch Fully Sharded Data Parallel and empirically demonstrated to outperform constant batch sizes and heuristic warmup schedules in the pretraining of Llama 2 family models, particularly smaller ones with up to 3 billion parameters. Theoretical convergence guarantees are established for these adaptive batch size schedules with Adam for general smooth nonconvex objectives.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper is about finding a good way to train large language models without wasting too much computer memory. Right now, people often choose the wrong batch sizes because they prioritize making their computers work faster rather than getting better results. The authors of this paper think that’s a bad idea and want to come up with a new method that works better. They propose some new ways to train models and show that these methods work better in practice. This is important because it could help people train even bigger and more powerful language models.

Keywords

» Artificial intelligence  » Generalization  » Llama  » Pretraining