Summary of Collapse or Thrive? Perils and Promises Of Synthetic Data in a Self-generating World, by Joshua Kazdan et al.
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
by Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo
First submitted to arxiv on: 22 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the consequences of pretraining generative machine learning models on web-scale datasets containing synthetic data generated by earlier models. The authors confirm that replacing all real data with purely synthetic data leads to “model collapse” across three generative model task-settings. However, they also demonstrate that accumulating synthetic and real data together and training on the combined data maintains stable performance without diverging test losses. Furthermore, they propose a workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation, observing slow degradation of test loss performance across generations. The authors’ findings have implications for forecasting the behavior of future generative models and studying the value of synthetic data in different contexts. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Generative machine learning models are like super-smart factories that can create new data. When these models are trained on a huge amount of fake data, something called “model collapse” might happen. This means the model becomes useless because it’s only good at creating more fake data. In this paper, researchers tested three ways to train generative models: replacing all real data with fake data, combining real and fake data together, and using small chunks of data for each training step. They found that the first method leads to model collapse, but the other two methods keep the model stable and working well. This study is important because it helps us understand what will happen when we use really powerful generative models in the future. |
Keywords
» Artificial intelligence » Generative model » Machine learning » Pretraining » Synthetic data