Summary of Is Model Collapse Inevitable? Breaking the Curse Of Recursion by Accumulating Real and Synthetic Data, By Matthias Gerstgrasser et al.

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

by Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

First submitted to arxiv on: 1 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper investigates how generative models pre-trained on web-scale data perform when trained on their own generated outputs. The authors propose that such models would experience “model collapse,” where performance degrades with each iteration until they become useless. However, this assumption relies on the idea that new data replace old data over time. The paper instead asks whether accumulating data affects model collapse and finds that replacing original real data with synthetic data leads to model collapse, but accumulating both real and synthetic data alongside each other avoids model collapse. This phenomenon is observed across various model sizes, architectures, and hyperparameters. The authors use an analytically tractable framework to understand why accumulating data can avoid model collapse, showing that the test error has a finite upper bound independent of the number of iterations.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Generative models are trained on huge amounts of data, but what happens when they’re trained on their own generated outputs? The paper answers this question by studying how language models perform when pre-trained on text corpora. It finds that replacing original real data with synthetic data leads to “model collapse,” where performance gets worse and worse until the model is useless. But if we accumulate both real and synthetic data, the model doesn’t collapse! This happens for different types of models and data too.

Keywords

» Artificial intelligence » Synthetic data

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

by Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Augmenting Ner Datasets with Llms: Towards Automated and Refined Annotation, by Yuji Naraki et al.

Summary of Can Biases in Imagenet Models Explain Generalization?, by Paul Gavrikov and Janis Keuper

Related Posts