Summary of How Bad Is Training on Synthetic Data? a Statistical Analysis Of Language Model Collapse, by Mohamed El Amine Seddik et al.
How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse
by Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah
First submitted to arxiv on: 7 Apr 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates the phenomenon of model collapse in language models, where new models trained on synthetic data generated from previously trained models experience performance deterioration. The study characterizes the impact of various recursive training scenarios using a statistical model and demonstrates that model collapse cannot be avoided when solely training on synthetic data. However, mixing real and synthetic data allows for an estimate of a maximum amount of synthetic data below which model collapse can be avoided. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Model collapse happens when new language models are trained on fake data made from earlier models’ output. This makes the original distribution’s tails disappear, causing future models to forget their initial training. Researchers want to understand this problem better. They use a statistical model to see how different training scenarios affect it. They found that if you only train on fake data, collapse happens. But if you mix real and fake data, there’s a limit below which you can avoid collapse. |
Keywords
* Artificial intelligence * Statistical model * Synthetic data