Summary of Data Shapley in One Training Run, by Jiachen T. Wang et al.
Data Shapley in One Training Run
by Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia
First submitted to arxiv on: 16 Jun 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces In-Run Data Shapley, a framework for attributing data’s contribution within machine learning contexts. The existing approaches require re-training models on different data subsets, which is computationally intensive and limits their application to large-scale models. In-Run Data Shapley addresses these limitations by offering scalable data attribution for a target model of interest, with negligible additional runtime compared to standard model training. This efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. The paper presents several case studies that offer fresh insights into pretraining data’s contribution and discusses their implications for copyright in generative AI and pretraining data curation. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Data Shapley is a framework that helps understand how data contributes to machine learning models. But existing methods are slow and can’t be used with big models. They also give the same answer for any model, so they can’t tell us which specific model we should look at. This paper fixes these problems by introducing In-Run Data Shapley, a way to attribute data’s contribution that is fast and efficient. It can even be used for pretraining foundation models! The paper shows how this works with some real-life examples and talks about what it means for things like AI copyright. |
Keywords
* Artificial intelligence * Machine learning * Pretraining