Summary of Stacking As Accelerated Gradient Descent, by Naman Agarwal and Pranjal Awasthi and Satyen Kale and Eric Zhao
Stacking as Accelerated Gradient Descentby Naman Agarwal, Pranjal Awasthi, Satyen Kale, Eric ZhaoFirst submitted to…
Stacking as Accelerated Gradient Descentby Naman Agarwal, Pranjal Awasthi, Satyen Kale, Eric ZhaoFirst submitted to…
Directional Smoothness and Gradient Methods: Convergence and Adaptivityby Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron…
On the Origins of Linear Representations in Large Language Modelsby Yibo Jiang, Goutham Rajendran, Pradeep…
Inverse-Free Fast Natural Gradient Descent Method for Deep Learningby Xinwei Ou, Ce Zhu, Xiaolin Huang,…
Level Set Teleportation: An Optimization Perspectiveby Aaron Mishkin, Alberto Bietti, Robert M. GowerFirst submitted to…
How Well Can Transformers Emulate In-context Newton’s Method?by Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris…
SOFIM: Stochastic Optimization Using Regularized Fisher Information Matrixby Mrinmay Sen, A. K. Qin, Gayathri C,…
Noise misleads rotation invariant algorithms on sparse targetsby Manfred K. Warmuth, Wojciech Kotłowski, Matt Jones,…
Enhancing LLM Safety via Constrained Direct Preference Optimizationby Zixuan Liu, Xiaolin Sun, Zizhan ZhengFirst submitted…
From Zero to Hero: How local curvature at artless initial conditions leads away from bad…