Summary of Most Influential Subset Selection: Challenges, Promises, and Beyond, by Yuzheng Hu et al.
Most Influential Subset Selection: Challenges, Promises, and Beyond
by Yuzheng Hu, Pingbang Hu, Han Zhao, Jiaqi W. Ma
First submitted to arxiv on: 25 Sep 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel approach to attributing machine learning model behaviors to their training data is proposed in this paper. The study focuses on the Most Influential Subset Selection (MISS) problem, which aims to identify a subset of training samples with the greatest collective influence. A comprehensive analysis of prevailing MISS approaches reveals strengths and weaknesses, highlighting that influence-based greedy heuristics can fail even in linear regression due to errors in influence function calculations and non-additive collective influence structures. An adaptive version of these heuristics is demonstrated to effectively capture interactions among samples and address issues. Experimental results on real-world datasets support theoretical findings and show adaptivity’s merit extends to classification tasks and non-linear neural networks. The paper questions the use of additive metrics like the Linear Datamodeling Score, emphasizing the trade-off between performance and computational efficiency. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Machine learning models are trained using data, but it’s hard to figure out which parts of that data make the model behave in a certain way. This paper explores how to identify important groups of training samples that have a big impact on the model’s behavior. The researchers looked at different ways people have tried to solve this problem and found some approaches don’t work well, especially when dealing with complex tasks like image classification. They also showed that a new approach can help address these issues by looking at how different parts of the data interact with each other. This research has implications for how we use machine learning in real-world applications. |
Keywords
» Artificial intelligence » Classification » Image classification » Linear regression » Machine learning