Summary of Data-efficient Learning Via Clustering-based Sensitivity Sampling: Foundation Models and Beyond, by Kyriakos Axiotis et al.
Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond
by Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder
First submitted to arxiv on: 27 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Data Structures and Algorithms (cs.DS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes a new approach to solving the data selection problem, which aims to efficiently train machine learning models by selecting a representative subset of data. The method combines k-means clustering with sensitivity sampling, leveraging an embedding representation of the data where the model loss is Hölder continuous. This allows for the selection of “typical” elements that accurately represent the average loss of the entire dataset, with provable guarantees on the accuracy and robustness of the selected subset. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine you’re trying to train a machine learning model, but you don’t have all the data. The data selection problem is about finding a small piece of that data that can teach your model most of what it needs to know. This paper introduces a new way to solve this problem using a combination of clustering and sampling techniques. By looking at how well each piece of data fits into different groups, we can pick out the most important bits and use them to train our model. This approach is useful for situations where you don’t have access to all the data or when you want to speed up training. |
Keywords
* Artificial intelligence * Clustering * Embedding * K means * Machine learning