Summary of Heavy-tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models, by Frederik Kunstner et al.

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

by Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti

First submitted to arxiv on: 29 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates why Adam outperforms gradient descent on large language models by a larger margin than on other tasks. The authors identify heavy-tailed class imbalance as a key factor, where the loss of infrequent words decreases more slowly than the loss of frequent ones when trained with gradient descent. This results in a slow decrease in average loss due to most samples coming from infrequent words. In contrast, Adam and sign-based methods are less sensitive to this problem. The paper demonstrates that this behavior can be reproduced across architectures and data types on language transformers, vision CNNs, and linear models. It also shows that class imbalance leads to imbalanced, correlated gradients and Hessians that benefit Adam. Finally, the authors prove that gradient descent converges slowly on low-frequency classes while sign descent does not in continuous time.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper helps us understand why a special kind of machine learning algorithm called Adam is better at large language tasks than another algorithm called gradient descent. The problem seems to be that language tasks have a lot of uncommon words, which makes the training process slower for gradient descent. But Adam and similar algorithms are not affected by this problem as much. To prove this, the researchers tested these algorithms on different types of data and found that they all behave similarly. This helps us understand why Adam is better at some tasks than others.

Keywords

* Artificial intelligence * Gradient descent * Machine learning

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

by Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Estimation and Deconvolution Of Second Order Cyclostationary Signals, by Igor Makienko et al.

Summary of Introducing User Feedback-based Counterfactual Explanations (ufce), by Muhammad Suffian et al.

Related Posts