Loading Now

Summary of How Bad Is Training on Synthetic Data? a Statistical Analysis Of Language Model Collapse, by Mohamed El Amine Seddik et al.


How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

by Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah

First submitted to arxiv on: 7 Apr 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper investigates the phenomenon of model collapse in language models, where new models trained on synthetic data generated from previously trained models experience performance deterioration. The study characterizes the impact of various recursive training scenarios using a statistical model and demonstrates that model collapse cannot be avoided when solely training on synthetic data. However, mixing real and synthetic data allows for an estimate of a maximum amount of synthetic data below which model collapse can be avoided.
Low GrooveSquid.com (original content) Low Difficulty Summary
Model collapse happens when new language models are trained on fake data made from earlier models’ output. This makes the original distribution’s tails disappear, causing future models to forget their initial training. Researchers want to understand this problem better. They use a statistical model to see how different training scenarios affect it. They found that if you only train on fake data, collapse happens. But if you mix real and fake data, there’s a limit below which you can avoid collapse.

Keywords

* Artificial intelligence  * Statistical model  * Synthetic data