Loading Now

Summary of A Systematic Review Of Neurips Dataset Management Practices, by Yiwei Wu et al.


A Systematic Review of NeurIPS Dataset Management Practices

by Yiwei Wu, Leah Ajmani, Shayne Longpre, Hanlin Li

First submitted to arxiv on: 31 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Machine learning educators can summarize this paper as follows: The lack of consistent practices in managing large datasets is a significant challenge in machine learning research. A systematic review of datasets published at NeurIPS reveals that dataset provenance is often unclear due to ambiguous filtering and curation processes, and only a few sites offer structured metadata and version control for hosting datasets. These findings underscore the need for standardized data infrastructures for publishing and managing datasets.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper shows how researchers don’t always follow good practices when sharing big datasets. It looks at four important things: where datasets come from, who gets them, what’s written about ethics, and what licenses are used. The results show that it’s hard to figure out where datasets came from because some filtering steps aren’t clear. Also, different websites host datasets but only a few help keep track of changes with version control. This makes us realize we need better ways to share and manage big datasets.

Keywords

* Artificial intelligence  * Machine learning