Summary of How Bad Is Training on Synthetic Data? a Statistical Analysis Of Language Model Collapse, by Mohamed El Amine Seddik et al.

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

by Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah

First submitted to arxiv on: 7 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper investigates the phenomenon of model collapse in language models, where new models trained on synthetic data generated from previously trained models experience performance deterioration. The study characterizes the impact of various recursive training scenarios using a statistical model and demonstrates that model collapse cannot be avoided when solely training on synthetic data. However, mixing real and synthetic data allows for an estimate of a maximum amount of synthetic data below which model collapse can be avoided.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Model collapse happens when new language models are trained on fake data made from earlier models’ output. This makes the original distribution’s tails disappear, causing future models to forget their initial training. Researchers want to understand this problem better. They use a statistical model to see how different training scenarios affect it. They found that if you only train on fake data, collapse happens. But if you mix real and fake data, there’s a limit below which you can avoid collapse.

Keywords

* Artificial intelligence * Statistical model * Synthetic data

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

by Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of A Robust Assessment For Invariant Representations, by Wenlu Tang et al.

Summary of Brusleattack: a Query-efficient Score-based Black-box Sparse Adversarial Attack, by Viet Quoc Vo et al.

Related Posts