Loading Now

Summary of Data Contamination Report From the 2024 Conda Shared Task, by Oscar Sainz et al.


Data Contamination Report from the 2024 CONDA Shared Task

by Oscar Sainz, Iker García-Ferrero, Alon Jacovi, Jon Ander Campos, Yanai Elazar, Eneko Agirre, Yoav Goldberg, Wei-Lin Chen, Jenny Chim, Leshem Choshen, Luca D’Amico-Wong, Melissa Dell, Run-Ze Fan, Shahriar Golchin, Yucheng Li, Pengfei Liu, Bhavish Pahwa, Ameya Prabhu, Suryansh Sharma, Emily Silcock, Kateryna Solonko, David Stap, Mihai Surdeanu, Yu-Min Tseng, Vishaal Udandarao, Zengzhi Wang, Ruijie Xu, Jinglin Yang

First submitted to arxiv on: 31 Jul 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The CONDA 2024 workshop investigates data contamination in natural language processing, where pre-training corpora for large-scale models include evaluation data, compromising results. The shared task collects evidence on contaminated datasets and models to help researchers avoid reporting biased evaluations. A public database tracks contaminations, with contributions welcomed via GitHub. This paper presents the initial compilation of 566 reported entries across 91 sources from 23 contributors.
Low GrooveSquid.com (original content) Low Difficulty Summary
Data contamination in natural language processing is when evaluation data gets mixed into pre-training corpora for large models, making results unfair. To fix this, a special task was done to collect information about which datasets and models are contaminated. This helps researchers make fair comparisons by knowing what resources to avoid using. A big database was created to keep track of all the contamination events, and people can add more information to it.

Keywords

* Artificial intelligence  * Natural language processing