Loading Now

Summary of Investigating Data Contamination For Pre-training Language Models, by Minhao Jiang et al.


Investigating Data Contamination for Pre-training Language Models

by Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo

First submitted to arxiv on: 11 Jan 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the impact of data contamination on pre-trained language models’ performance on downstream tasks. Data contamination occurs when evaluation datasets are included in the pre-training corpus, artificially increasing performance. The authors pre-train GPT-2 models from scratch to explore the effects of text and ground-truth contamination from evaluation data. They also examine the limitations of current n-gram-based definitions of contamination and its inadequacy. Findings offer new insights into data contamination’s effects on language model capabilities, highlighting the need for independent assessments in LLM studies.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how pre-training language models affects their performance on other tasks. Sometimes, evaluation datasets are included in the training data, which can make the model look better than it really is. The authors want to see what happens when they train a model from scratch without any contaminated data. They’re looking at two types of contamination: when the input text includes specific phrases and when the desired output is given. Their findings will help us understand how language models work and why we need to be careful about using certain data.

Keywords

* Artificial intelligence  * Gpt  * Language model  * N gram