Summary of Data Shapley in One Training Run, by Jiachen T. Wang et al.

Data Shapley in One Training Run

by Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

First submitted to arxiv on: 16 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper introduces In-Run Data Shapley, a framework for attributing data’s contribution within machine learning contexts. The existing approaches require re-training models on different data subsets, which is computationally intensive and limits their application to large-scale models. In-Run Data Shapley addresses these limitations by offering scalable data attribution for a target model of interest, with negligible additional runtime compared to standard model training. This efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. The paper presents several case studies that offer fresh insights into pretraining data’s contribution and discusses their implications for copyright in generative AI and pretraining data curation.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Data Shapley is a framework that helps understand how data contributes to machine learning models. But existing methods are slow and can’t be used with big models. They also give the same answer for any model, so they can’t tell us which specific model we should look at. This paper fixes these problems by introducing In-Run Data Shapley, a way to attribute data’s contribution that is fast and efficient. It can even be used for pretraining foundation models! The paper shows how this works with some real-life examples and talks about what it means for things like AI copyright.

Keywords

* Artificial intelligence * Machine learning * Pretraining

Data Shapley in One Training Run

by Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Weshap: Weak Supervision Source Evaluation with Shapley Values, by Naiqing Guan and Nick Koudas

Summary of Concept-skill Transferability-based Data Selection For Large Vision-language Models, by Jaewoo Lee et al.

Related Posts