Summary of Towards Quantifying the Preconditioning Effect Of Adam, by Rudrajit Das et al.
Towards Quantifying the Preconditioning Effect of Adam
by Rudrajit Das, Naman Agarwal, Sujay Sanghavi, Inderjit S. Dhillon
First submitted to arxiv on: 11 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Adam’s preconditioning effect on gradient descent (GD) has long been an open question in the field of optimization. This paper provides a detailed analysis of Adam’s performance for quadratic functions, showing that it can alleviate the curse of ill-conditioning at the expense of introducing a dimension-dependent quantity. The results demonstrate that Adam’s iteration complexity is influenced by both the condition number and dimension of the Hessian, with a bound of O(min(d, κ)) for diagonal Hessians and O(min(d√dk, κ)) for diagonally dominant Hessians. This means that when d < O(κ^p), where p = 1 for diagonal Hessians and p = 1/3 for diagonally dominant Hessians, Adam outperforms GD. However, the analysis also reveals scenarios where Adam is worse than GD even if d ≪ O(κ^(1/3)). Empirical evidence corroborates these findings. The paper extends its results to functions satisfying per-coordinate Lipschitz smoothness and a modified version of the Polyak-Łojasiewicz condition. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Adam’s preconditioning effect on gradient descent (GD) is analyzed in this study. It shows how Adam can help with quadratic functions, but it also has some downsides. The main finding is that Adam gets better or worse than GD depending on the shape of the problem and the number of dimensions. This means that sometimes Adam is a good choice, but other times GD might be better. The research also looks at special cases where the function is smooth and easy to optimize. |
Keywords
* Artificial intelligence * Gradient descent * Optimization