Summary of Is Model Collapse Inevitable? Breaking the Curse Of Recursion by Accumulating Real and Synthetic Data, By Matthias Gerstgrasser et al.
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
by Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo
First submitted to arxiv on: 1 Apr 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates how generative models pre-trained on web-scale data perform when trained on their own generated outputs. The authors propose that such models would experience “model collapse,” where performance degrades with each iteration until they become useless. However, this assumption relies on the idea that new data replace old data over time. The paper instead asks whether accumulating data affects model collapse and finds that replacing original real data with synthetic data leads to model collapse, but accumulating both real and synthetic data alongside each other avoids model collapse. This phenomenon is observed across various model sizes, architectures, and hyperparameters. The authors use an analytically tractable framework to understand why accumulating data can avoid model collapse, showing that the test error has a finite upper bound independent of the number of iterations. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Generative models are trained on huge amounts of data, but what happens when they’re trained on their own generated outputs? The paper answers this question by studying how language models perform when pre-trained on text corpora. It finds that replacing original real data with synthetic data leads to “model collapse,” where performance gets worse and worse until the model is useless. But if we accumulate both real and synthetic data, the model doesn’t collapse! This happens for different types of models and data too. |
Keywords
» Artificial intelligence » Synthetic data