Summary of Language Models Scale Reliably with Over-training and on Downstream Tasks, by Samir Yitzhak Gadre and Georgios Smyrnis and Vaishaal Shankar and Suchin Gururangan and Mitchell Wortsman and Rulin Shao and Jean Mercat and Alex Fang and Jeffrey Li and Sedrick Keh and Rui Xin and Marianna Nezhurina and Igor Vasiljevic and Jenia Jitsev and Luca Soldaini and Alexandros G. Dimakis and Gabriel Ilharco and Pang Wei Koh and Shuran Song and Thomas Kollar and Yair Carmon and Achal Dave and Reinhard Heckel and Niklas Muennighoff and Ludwig Schmidt
Language models scale reliably with over-training and on downstream tasks
by Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt
First submitted to arxiv on: 13 Mar 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed research develops a novel approach to predicting the performance of large language models using smaller, more computationally efficient experiments. The study addresses two limitations in current scaling laws: (1) over-training is often employed to reduce inference costs, whereas scaling laws typically assume compute-optimal training; and (2) scaling laws primarily predict loss on next-token prediction, whereas models are usually evaluated on downstream task performance. To overcome these limitations, the researchers create a testbed of 104 models with varying parameter sizes and train them on three data distributions. They then fit scaling laws that extrapolate to both over-trained and compute-optimal regimes, enabling predictions for larger model runs using significantly less computational resources. Additionally, they propose a power law relating language model perplexity to downstream task performance, which allows for predicting top-1 error averages across tasks with reduced computational costs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The study aims to develop a more accurate method for predicting the performance of large language models by addressing two key limitations in current scaling laws. The researchers create a testbed of 104 models and train them on different data distributions to fit new scaling laws that can predict performance even when over-training is used or compute-optimal training isn’t. This allows for making predictions about how well larger models will perform using much less computational power. They also propose a power law that shows how language model perplexity relates to downstream task performance, which helps predict the average error rate across different tasks with reduced computing costs. |
Keywords
* Artificial intelligence * Inference * Language model * Perplexity * Scaling laws * Token