Summary of Understanding Warmup-stable-decay Learning Rates: a River Valley Loss Landscape Perspective, by Kaiyue Wen et al.
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
by Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma
First submitted to arxiv on: 7 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The Warmup-Stable-Decay (WSD) schedule is a novel approach to training language models that doesn’t require pre-determining a fixed compute budget. Unlike traditional cosine learning rate schedules, WSD uses a constant learning rate to produce iterates that can continue indefinitely. This allows for branching out from the main branch at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve with elevated loss during the stable phase and sharp decline during the decay phase. The paper proposes a river valley landscape assumption, showing that the stable phase undergoes large oscillations while progressing swiftly along the river, whereas the decay phase minimizes oscillations, revealing true optimization progress. This phenomenon is consistent with empirical observations and can emerge from pretraining on simple datasets. Inspired by this theory, the authors introduce WSD-S, a variant of WSD that reuses previous checkpoints’ decay phases and keeps only one main branch, empirically outperforming WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper introduces a new way to train language models without needing to decide how much computing power to use ahead of time. The authors suggest a method called Warmup-Stable-Decay (WSD) that lets them train the model indefinitely, then branch off and finish training whenever they want. This approach produces an unusual curve for measuring progress during training. The paper proposes that this is because the model is moving through a “river valley” landscape during training, where it makes big jumps initially but eventually settles down and finds its way to better solutions. This idea helps explain why WSD works well in practice. To make things even better, the authors came up with a new version of WSD called WSD-S that can train multiple models at once, which is really useful. |
Keywords
» Artificial intelligence » Language model » Optimization » Pretraining