Loading Now

Summary of Can We Remove the Square-root in Adaptive Gradient Methods? a Second-order Perspective, by Wu Lin et al.


Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

by Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani

First submitted to arxiv on: 5 Feb 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Optimization and Control (math.OC)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the behavior of adaptive gradient optimizers, like Adam(W), when they remove their diagonal preconditioner, effectively strengthening their second-order motivation. The authors find that these “square-root-free” adaptive methods excel at closing the generalization gap on convolutional architectures, while maintaining performance on transformers. This second-order perspective also enables the development of non-diagonal methods incorporating arbitrary curvature approximations through preconditioner invariance. Notably, root-free counterparts work well and fast with half-precision, as they avoid numerically unstable matrix root decompositions and inversions. The findings provide new insights into adaptive method development, raising questions about their overlooked role in success.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how some algorithms that help train deep learning models change when they don’t use a special part of themselves. They found that these changed algorithms work better on certain types of images and still do well on others. This way of thinking also helps create new ways to make the training process more efficient. The surprising thing is that this way works really well even when using less precise calculations, which is important because it makes the whole process faster and more practical.

Keywords

* Artificial intelligence  * Deep learning  * Generalization  * Precision