Summary of Reconciling Kaplan and Chinchilla Scaling Laws, by Tim Pearce et al.
Reconciling Kaplan and Chinchilla Scaling Laws
by Tim Pearce, Jinyeop Song
First submitted to arxiv on: 12 Jun 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The study compares the scaling behavior of transformers trained on next-token language prediction, analyzing the relationship between model size, training data, and computational budget. The original papers by Kaplan et al. (2020) and Hoffmann et al. (2022) proposed different optimal scaling coefficients for achieving minimal loss with a given compute budget. This paper investigates the discrepancy between these findings, attributing it to Kaplan’s incorrect counting of non-embedding parameters. Simulating the Chinchilla study under similar conditions yields biased scaling coefficients close to Kaplan’s, reaffirming the original results. Additionally, this research highlights differences in the reported loss-compute relationships and recommends using total parameters and compute for future scaling studies. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper compares how big models for language prediction get better with more training data or computer power. Two earlier papers said different things about how many model parts (parameters) are needed to make a good model. This new study figured out why those two papers didn’t agree, and it turns out that one of the papers was counting some things incorrectly. The paper also explains why we see different patterns when we look at how well models do on language tasks versus computer power. Overall, this research helps us understand what makes big language models work better or worse. |
Keywords
* Artificial intelligence * Embedding * Token