Loading Now

Summary of The Ademamix Optimizer: Better, Faster, Older, by Matteo Pagliardini et al.


The AdEMAMix Optimizer: Better, Faster, Older

by Matteo Pagliardini, Pierre Ablin, David Grangier

First submitted to arxiv on: 5 Sep 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Machine Learning (stat.ML)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed AdEMAMix optimizer is a modified version of the Adam optimizer that leverages momentum-based accumulation of past gradients. Unlike traditional Exponential Moving Average (EMA) methods, which use a single EMA to account for local linear approximations of older gradients, AdEMAMix uses a mixture of two EMAs to better take advantage of past gradients. Experimental results on language modeling and image classification demonstrate that gradients can remain relevant for tens of thousands of steps, leading to faster convergence and lower minima. For instance, an AdEMAMix large language model (LLM) trained on 101 billion tokens performs similarly to an AdamW model trained on 197 billion tokens, with a significant reduction in model forgetting during training.
Low GrooveSquid.com (original content) Low Difficulty Summary
AdEMAMix is a new way of using past gradients to help train machine learning models. Instead of just remembering the most recent gradients, AdEMAMix uses two different ways to remember older gradients too. This helps the model learn better and forget less over time. Tests show that this method can make models converge faster and find lower “minima” (the bottom of a valley in the graph). For example, one big language model trained with AdEMAMix did as well as another one trained with a different method, but used much less data.

Keywords

* Artificial intelligence  * Image classification  * Language model  * Large language model  * Machine learning