Loading Now

Summary of Breaking Mlperf Training: a Case Study on Optimizing Bert, by Yongdeok Kim et al.


Breaking MLPerf Training: A Case Study on Optimizing BERT

by Yongdeok Kim, Jaehyung Ahn, Myeongwoo Kim, Changin Choi, Heejae Kim, Narankhuu Tuvshinjargal, Seungwon Lee, Yanzi Zhang, Yuan Pei, Xiongzhan Linghu, Jingkun Ma, Lin Chen, Yuehua Dai, Sungjoo Yoo

First submitted to arxiv on: 4 Feb 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper presents novel approaches for speeding up large-scale distributed training of BERT models, improving load balancing, communication, optimizers, and more. By individually addressing each component, the authors achieve a new level of performance in BERT training. Load balancing is crucial due to varying sample lengths, while communication cost is minimized by hiding it with useful computation. Optimizers like ADAM and LAMB are re-evaluated for large-scale distributed training. The authors propose local presorting based on dataset stratification for load balancing and bucket-wise gradient clipping before allreduce, allowing for overlapping gradient computation and synchronization. They also optimize existing optimizers via hyperparameter tuning, utilizing ADAM with larger batches. The combined approach yields the fastest MLPerf BERT training of 25.1 (22.3) seconds on 1,024 NVIDIA A100 GPUs, outperforming top submissions to MLPerf v1.1 and v2.0.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper makes it faster to train big language models like BERT. It finds ways to make the training process more efficient by improving how data is distributed among many computers, reducing communication costs, and choosing the best optimizer for the job. The authors also come up with new ideas to sort data before processing it and to clip gradients before sharing them between machines. These innovations lead to a big speed boost in BERT training.

Keywords

* Artificial intelligence  * Bert  * Hyperparameter