Summary of Automatic Data Curation For Self-supervised Learning: a Clustering-based Approach, by Huy V. Vo et al.
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
by Huy V. Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, Hervé Jégou, Patrick Labatut, Piotr Bojanowski
First submitted to arxiv on: 24 May 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A self-supervised learning approach for pre-training machine learning models is proposed, which involves automatically curating large, diverse, and balanced datasets. The method uses a clustering-based approach to group data points into uniform clusters representing different concepts, followed by hierarchical sampling to select a balanced set of samples from these clusters. Experimental results on three domains (web-based images, satellite images, and text) show that features trained on the automatically curated datasets outperform those trained on uncured data and are comparable to those trained on manually curated data. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Machine learning models can learn without being explicitly told what’s right or wrong by using self-supervised learning. To make this work, you need big, diverse, and well-balanced datasets. Building these datasets can be time-consuming and expensive. This paper suggests a new way to build such datasets automatically. It uses a method called clustering to group similar data points together, then picks a balanced set of samples from each group. The results show that this approach works as well or even better than other methods. |
Keywords
» Artificial intelligence » Clustering » Machine learning » Self supervised