Summary of 4+3 Phases Of Compute-optimal Neural Scaling Laws, by Elliot Paquette et al.
4+3 Phases of Compute-Optimal Neural Scaling Laws
by Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington
First submitted to arxiv on: 23 May 2024
Categories
- Main: Machine Learning (stat.ML)
- Secondary: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed neural scaling model is used to study the compute-limited, infinite-data scaling law regime. The model has three parameters: data complexity, target complexity, and model-parameter-count. To train the model, one-pass stochastic gradient descent is run on a mean-squared loss. The authors derive a representation of the loss curves that holds over all iteration counts and improves in accuracy as the model parameter count grows. The optimal model-parameter-count is derived as a function of floating point operation budget. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The neural scaling model helps researchers understand how to optimize models for large amounts of data. The model has three main factors: how complex the data is, how complex the target (what we’re trying to learn) is, and how many parameters the model uses. By training the model with a special type of algorithm called one-pass stochastic gradient descent, researchers can get more accurate results as they increase the number of model parameters. The study shows that there are four main phases in how the model works depending on these three factors, and it provides mathematical proof and examples to support this. |
Keywords
» Artificial intelligence » Stochastic gradient descent