Summary of Heavy-tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models, by Frederik Kunstner et al.
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
by Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti
First submitted to arxiv on: 29 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL); Optimization and Control (math.OC); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates why Adam outperforms gradient descent on large language models by a larger margin than on other tasks. The authors identify heavy-tailed class imbalance as a key factor, where the loss of infrequent words decreases more slowly than the loss of frequent ones when trained with gradient descent. This results in a slow decrease in average loss due to most samples coming from infrequent words. In contrast, Adam and sign-based methods are less sensitive to this problem. The paper demonstrates that this behavior can be reproduced across architectures and data types on language transformers, vision CNNs, and linear models. It also shows that class imbalance leads to imbalanced, correlated gradients and Hessians that benefit Adam. Finally, the authors prove that gradient descent converges slowly on low-frequency classes while sign descent does not in continuous time. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps us understand why a special kind of machine learning algorithm called Adam is better at large language tasks than another algorithm called gradient descent. The problem seems to be that language tasks have a lot of uncommon words, which makes the training process slower for gradient descent. But Adam and similar algorithms are not affected by this problem as much. To prove this, the researchers tested these algorithms on different types of data and found that they all behave similarly. This helps us understand why Adam is better at some tasks than others. |
Keywords
* Artificial intelligence * Gradient descent * Machine learning