Summary of Can We Remove the Square-root in Adaptive Gradient Methods? a Second-order Perspective, by Wu Lin et al.

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

by Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani

First submitted to arxiv on: 5 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the behavior of adaptive gradient optimizers, like Adam(W), when they remove their diagonal preconditioner, effectively strengthening their second-order motivation. The authors find that these “square-root-free” adaptive methods excel at closing the generalization gap on convolutional architectures, while maintaining performance on transformers. This second-order perspective also enables the development of non-diagonal methods incorporating arbitrary curvature approximations through preconditioner invariance. Notably, root-free counterparts work well and fast with half-precision, as they avoid numerically unstable matrix root decompositions and inversions. The findings provide new insights into adaptive method development, raising questions about their overlooked role in success.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper looks at how some algorithms that help train deep learning models change when they don’t use a special part of themselves. They found that these changed algorithms work better on certain types of images and still do well on others. This way of thinking also helps create new ways to make the training process more efficient. The surprising thing is that this way works really well even when using less precise calculations, which is important because it makes the whole process faster and more practical.

Keywords

* Artificial intelligence * Deep learning * Generalization * Precision

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

by Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Challenges in Variable Importance Ranking Under Correlation, by Annie Liang and Thomas Jemielita and Andy Liaw and Vladimir Svetnik and Lingkang Huang and Richard Baumgartner and Jason M. Klusowski

Summary of Online Feature Updates Improve Online (generalized) Label Shift Adaptation, by Ruihan Wu and Siddhartha Datta and Yi Su and Dheeraj Baby and Yu-xiang Wang and Kilian Q. Weinberger

Related Posts