Summary of Data Contamination Report From the 2024 Conda Shared Task, by Oscar Sainz et al.

Data Contamination Report from the 2024 CONDA Shared Task

by Oscar Sainz, Iker García-Ferrero, Alon Jacovi, Jon Ander Campos, Yanai Elazar, Eneko Agirre, Yoav Goldberg, Wei-Lin Chen, Jenny Chim, Leshem Choshen, Luca D’Amico-Wong, Melissa Dell, Run-Ze Fan, Shahriar Golchin, Yucheng Li, Pengfei Liu, Bhavish Pahwa, Ameya Prabhu, Suryansh Sharma, Emily Silcock, Kateryna Solonko, David Stap, Mihai Surdeanu, Yu-Min Tseng, Vishaal Udandarao, Zengzhi Wang, Ruijie Xu, Jinglin Yang

First submitted to arxiv on: 31 Jul 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The CONDA 2024 workshop investigates data contamination in natural language processing, where pre-training corpora for large-scale models include evaluation data, compromising results. The shared task collects evidence on contaminated datasets and models to help researchers avoid reporting biased evaluations. A public database tracks contaminations, with contributions welcomed via GitHub. This paper presents the initial compilation of 566 reported entries across 91 sources from 23 contributors.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Data contamination in natural language processing is when evaluation data gets mixed into pre-training corpora for large models, making results unfair. To fix this, a special task was done to collect information about which datasets and models are contaminated. This helps researchers make fair comparisons by knowing what resources to avoid using. A big database was created to keep track of all the contamination events, and people can add more information to it.

Keywords

* Artificial intelligence * Natural language processing

Data Contamination Report from the 2024 CONDA Shared Task

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Tracksorter: a Transformer-based Sorting Algorithm For Track Finding in High Energy Physics, by Yash Melkani et al.

Summary of Probabilistic Scoring Lists For Interpretable Machine Learning, by Jonas Hanselle et al.

Related Posts