Summary of Why Transformers Need Adam: a Hessian Perspective, by Yushun Zhang et al.
Why Transformers Need Adam: A Hessian Perspective
by Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo
First submitted to arxiv on: 26 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary SGD struggles to keep pace with Adam on Transformer-based models, but why? This paper provides a thorough explanation through the lens of Hessian matrices. It’s revealed that Transformers exhibit “block heterogeneity,” meaning different parameter blocks have vastly varying Hessian spectra. This heterogeneity hampers SGD, causing it to perform poorly compared to Adam. The authors validate their findings on various Transformer models, CNNs, and quadratic problems, demonstrating that SGD can match Adam when there’s no block heterogeneity but falters when it does exist. An initial theoretical analysis suggests that SGD’s single learning rate per block is insufficient to handle the heterogeneity, making coordinate-wise learning rates, as employed in Adam, a potential solution. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary SGD doesn’t work well with Transformer models and we don’t know why. This paper explains what’s going on using special math tools called Hessian matrices. It turns out that Transformers have different parts that are really good or bad at changing, which makes it hard for SGD to learn. The authors tested this idea on lots of different models and found that when the parts are all good or bad at changing, SGD can do as well as Adam. But when the parts are mixed, like in a Transformer model, SGD doesn’t work as well. They think that this is because SGD uses one learning rate for everything, which isn’t enough to handle the different parts. |
Keywords
* Artificial intelligence * Transformer