Summary of The Ademamix Optimizer: Better, Faster, Older, by Matteo Pagliardini et al.
The AdEMAMix Optimizer: Better, Faster, Older
by Matteo Pagliardini, Pierre Ablin, David Grangier
First submitted to arxiv on: 5 Sep 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed AdEMAMix optimizer is a modified version of the Adam optimizer that leverages momentum-based accumulation of past gradients. Unlike traditional Exponential Moving Average (EMA) methods, which use a single EMA to account for local linear approximations of older gradients, AdEMAMix uses a mixture of two EMAs to better take advantage of past gradients. Experimental results on language modeling and image classification demonstrate that gradients can remain relevant for tens of thousands of steps, leading to faster convergence and lower minima. For instance, an AdEMAMix large language model (LLM) trained on 101 billion tokens performs similarly to an AdamW model trained on 197 billion tokens, with a significant reduction in model forgetting during training. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary AdEMAMix is a new way of using past gradients to help train machine learning models. Instead of just remembering the most recent gradients, AdEMAMix uses two different ways to remember older gradients too. This helps the model learn better and forget less over time. Tests show that this method can make models converge faster and find lower “minima” (the bottom of a valley in the graph). For example, one big language model trained with AdEMAMix did as well as another one trained with a different method, but used much less data. |
Keywords
* Artificial intelligence * Image classification * Language model * Large language model * Machine learning