Loading Now

Summary of Time Transfer: on Optimal Learning Rate and Batch Size in the Infinite Data Limit, by Oleg Filatov et al.


Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit

by Oleg Filatov, Jan Ebert, Jiangtao Wang, Stefan Kesselheim

First submitted to arxiv on: 8 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper explores the challenge of optimally scaling large language models (LLMs) by tuning hyperparameters like learning rate η and batch size B. The authors investigate the relationship between these parameters, particularly in the context of infinite data sizes. They find that the optimal scaling behavior depends on the pretraining token budget T, batch size B, and critical batch size BCrit, which evolves proportionally to T. Furthermore, they demonstrate that the optimal batch size is positively correlated with BCrit and that conventional views of BCrit dependence solely on loss value are challenged by their findings. The authors also examine the sensitivity of loss to changes in learning rate, finding a decrease in sensitivity with increasing T and constant sensitivity with μP model scaling.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps us understand how to make large language models work better by adjusting some key settings. It shows that there’s a connection between these settings and the amount of data we’re working with. The authors found that the best way to adjust these settings depends on things like the size of the model, the batch size, and the “critical” point where the loss (or error) starts to change. They also looked at how sensitive the loss is to changes in the learning rate and found that it gets less sensitive as we use more data.

Keywords

» Artificial intelligence  » Pretraining  » Token