Loading Now

Summary of Scaling Optimal Lr Across Token Horizons, by Johan Bjorck et al.


Scaling Optimal LR Across Token Horizons

by Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, Xia Song

First submitted to arxiv on: 30 Sep 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Medium Difficulty summary: State-of-the-art Large Language Models (LLMs) rely on scaling, including increasing model size, dataset size, and cluster size. To optimize performance without extensive hyperparameter tuning, researchers must transfer optimal parameters from smaller experiments to larger ones. While hyperparameter transfer has been studied for different model sizes, the relationship between optimal learning rate (LR) and token horizon (dataset size) in LLM training remains unexplored. This study addresses this gap by conducting a large-scale empirical investigation on how LR depends on token horizon. The results show that optimal LR changes significantly with token horizon, requiring smaller LR for longer training. Furthermore, the findings demonstrate that the optimal LR follows a scaling law, enabling accurate estimation of optimal LR for longer horizons from shorter ones. Additionally, the study provides a rule-of-thumb for transferring LR across token horizons without additional overhead and reveals that LLama-1 used an overly high LR, resulting in a performance hit. This work highlights the importance of hyperparameter transfer across dataset size as a crucial component of LLM training.
Low GrooveSquid.com (original content) Low Difficulty Summary
Low Difficulty summary: This paper studies how to make big language models better by learning from smaller datasets. Usually, scientists tune many settings to get the best results, but this takes too much time and money. Instead, they can take what works well in small experiments and apply it to larger ones. The researchers looked at how something called “learning rate” changes when working with bigger or smaller datasets. They found that the learning rate needs to be adjusted depending on the size of the dataset. This is important because big language models like LLama-1 might not be using the right settings, which means they’re not performing as well as they could.

Keywords

» Artificial intelligence  » Hyperparameter  » Llama  » Token