Loading Now

Summary of Automatic Dataset Construction (adc): Sample Collection, Data Curation, and Beyond, by Minghao Liu et al.


Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond

by Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhao, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu

First submitted to arxiv on: 21 Aug 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed Automatic Dataset Construction (ADC) methodology leverages Large Language Models (LLMs) to automate dataset creation with negligible cost and high efficiency. This approach addresses the challenge of creating high-quality datasets quickly and accurately by reducing manual annotation needs and speeding up data generation. ADC is designed for image classification tasks, but can be applied to other domains as well. The proposed methodology encounters real-world challenges such as label errors (label noise) and imbalanced data distributions (label bias), which are addressed using existing methods for robust learning under noisy and biased data. To facilitate further research, three benchmark datasets focused on label noise detection, label noise learning, and class-imbalanced learning are designed. The performance of popular methods is evaluated on these datasets, highlighting the importance of addressing label noise in machine learning.
Low GrooveSquid.com (original content) Low Difficulty Summary
Automatic Dataset Construction (ADC) makes it easier to create high-quality training data by using computers instead of humans. This helps solve a big problem where there isn’t enough good training data for some tasks. ADC uses special computer models called Large Language Models (LLMs) to collect relevant samples from search engines, reducing the need for manual annotation and making data generation faster. However, it also faces challenges like errors in labeling data and unequal distribution of classes. To overcome these issues, existing methods are used to ensure a better training process. Additionally, three new datasets are created that focus on detecting label noise and learning under noisy and imbalanced data. These datasets are important because there aren’t many similar ones available.

Keywords

» Artificial intelligence  » Image classification  » Machine learning