Loading Now

Summary of Reconciling Kaplan and Chinchilla Scaling Laws, by Tim Pearce et al.


Reconciling Kaplan and Chinchilla Scaling Laws

by Tim Pearce, Jinyeop Song

First submitted to arxiv on: 12 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The study compares the scaling behavior of transformers trained on next-token language prediction, analyzing the relationship between model size, training data, and computational budget. The original papers by Kaplan et al. (2020) and Hoffmann et al. (2022) proposed different optimal scaling coefficients for achieving minimal loss with a given compute budget. This paper investigates the discrepancy between these findings, attributing it to Kaplan’s incorrect counting of non-embedding parameters. Simulating the Chinchilla study under similar conditions yields biased scaling coefficients close to Kaplan’s, reaffirming the original results. Additionally, this research highlights differences in the reported loss-compute relationships and recommends using total parameters and compute for future scaling studies.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper compares how big models for language prediction get better with more training data or computer power. Two earlier papers said different things about how many model parts (parameters) are needed to make a good model. This new study figured out why those two papers didn’t agree, and it turns out that one of the papers was counting some things incorrectly. The paper also explains why we see different patterns when we look at how well models do on language tasks versus computer power. Overall, this research helps us understand what makes big language models work better or worse.

Keywords

* Artificial intelligence  * Embedding  * Token