Loading Now

Summary of Non-convergence to Global Minimizers in Data Driven Supervised Deep Learning: Adam and Stochastic Gradient Descent Optimization Provably Fail to Converge to Global Minimizers in the Training Of Deep Neural Networks with Relu Activation, by Thang Do and Sonja Hannibal and Arnulf Jentzen


Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation

by Thang Do, Sonja Hannibal, Arnulf Jentzen

First submitted to arxiv on: 14 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Numerical Analysis (math.NA); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Deep learning methods, specifically deep neural networks (DNNs) trained by stochastic gradient descent (SGD) optimization, are crucial tools for solving data-driven supervised learning problems. Despite their success, it remains an open problem to rigorously explain their effectiveness and limitations. This paper tackles the question of whether SGD methods converge to global minimizers in the training of DNNs with rectified linear unit (ReLU) activation functions, providing a negative answer. We prove that for a large class of SGD methods, including accelerated and adaptive variants like momentum SGD, Nesterov accelerated SGD, Adagrad, RMSProp, Adam, Adamax, AMSGrad, and Nadam optimizers, the probability of not converging to global minimizers increases exponentially with the width and depth of the DNN. This result has implications for understanding the convergence properties of popular deep learning methods.
Low GrooveSquid.com (original content) Low Difficulty Summary
Deep learning is a powerful tool that helps computers learn from data. Despite its success, scientists didn’t fully understand why it worked. Researchers proved that some common training methods don’t always find the best solution, which is important to know when developing new AI models. The more complex the model, the less likely it will find the best answer. This discovery can help improve AI and deep learning in general.

Keywords

» Artificial intelligence  » Deep learning  » Optimization  » Probability  » Relu  » Stochastic gradient descent  » Supervised