Summary of Investigating Data Contamination For Pre-training Language Models, by Minhao Jiang et al.

Investigating Data Contamination for Pre-training Language Models

by Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo

First submitted to arxiv on: 11 Jan 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the impact of data contamination on pre-trained language models’ performance on downstream tasks. Data contamination occurs when evaluation datasets are included in the pre-training corpus, artificially increasing performance. The authors pre-train GPT-2 models from scratch to explore the effects of text and ground-truth contamination from evaluation data. They also examine the limitations of current n-gram-based definitions of contamination and its inadequacy. Findings offer new insights into data contamination’s effects on language model capabilities, highlighting the need for independent assessments in LLM studies.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper looks at how pre-training language models affects their performance on other tasks. Sometimes, evaluation datasets are included in the training data, which can make the model look better than it really is. The authors want to see what happens when they train a model from scratch without any contaminated data. They’re looking at two types of contamination: when the input text includes specific phrases and when the desired output is given. Their findings will help us understand how language models work and why we need to be careful about using certain data.

Keywords

* Artificial intelligence * Gpt * Language model * N gram

Investigating Data Contamination for Pre-training Language Models

by Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Object-centric Diffusion For Efficient Video Editing, by Kumara Kahatapitiya et al.

Summary of Secrets Of Rlhf in Large Language Models Part Ii: Reward Modeling, by Binghai Wang et al.

Related Posts