Summary of Power Scheduler: a Batch Size and Token Number Agnostic Learning Rate Scheduler, by Yikang Shen et al.

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda

First submitted to arxiv on: 23 Aug 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper explores the optimal learning rate for language model pretraining, a challenging task due to the complex correlations between various hyperparameters. Despite recent studies proposing small proxy models and small corpora for hyperparameter searches, the zero-shot transferability from small to large corpora remains underexplored. The authors investigate the relationship between optimal learning rates, batch sizes, and number of training tokens for the WSD scheduler and find a power-law correlation that transfers across model sizes. They propose a new Power scheduler agnostic to these variables and demonstrate its effectiveness when combined with Maximum Update Parameterization (muP). The experiment shows that one set of hyperparameters can achieve impressive performance regardless of model size, architecture, or training conditions.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper is about finding the best way for language models to learn. Right now, it’s hard to figure out what works best because there are many things that affect how well a model learns. Some researchers have tried using smaller models and datasets to test different settings, but we don’t know if these results will work for bigger models. This paper looks at the relationship between some important variables (like learning rate, batch size, and training data) for a specific type of scheduler called WSD. They found that there’s a pattern in how these variables affect each other, and this pattern can be applied to different model sizes. The authors also propose a new way of adjusting the learning rate that doesn’t depend on the size of the dataset or batch. This new approach works well when combined with another technique called Maximum Update Parameterization (muP).

Keywords

* Artificial intelligence * Hyperparameter * Language model * Pretraining * Transferability * Zero shot

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Neurcam: Interpretable Neural Clustering Via Additive Models, by Nakul Upadhya and Eldan Cohen

Summary of What If? Causal Machine Learning in Supply Chain Risk Management, by Mateusz Wyrembek et al.

Related Posts