Summary of A Systematic Review Of Neurips Dataset Management Practices, by Yiwei Wu et al.
A Systematic Review of NeurIPS Dataset Management Practices
by Yiwei Wu, Leah Ajmani, Shayne Longpre, Hanlin Li
First submitted to arxiv on: 31 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Machine learning educators can summarize this paper as follows: The lack of consistent practices in managing large datasets is a significant challenge in machine learning research. A systematic review of datasets published at NeurIPS reveals that dataset provenance is often unclear due to ambiguous filtering and curation processes, and only a few sites offer structured metadata and version control for hosting datasets. These findings underscore the need for standardized data infrastructures for publishing and managing datasets. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper shows how researchers don’t always follow good practices when sharing big datasets. It looks at four important things: where datasets come from, who gets them, what’s written about ethics, and what licenses are used. The results show that it’s hard to figure out where datasets came from because some filtering steps aren’t clear. Also, different websites host datasets but only a few help keep track of changes with version control. This makes us realize we need better ways to share and manage big datasets. |
Keywords
* Artificial intelligence * Machine learning