Loading Now

Summary of Data Pruning in Generative Diffusion Models, by Rania Briq et al.


Data Pruning in Generative Diffusion Models

by Rania Briq, Jiangtao Wang, Stefan Kesselheim

First submitted to arxiv on: 19 Nov 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computer Vision and Pattern Recognition (cs.CV)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the application of data pruning techniques to generative diffusion models, with the goal of improving their accuracy. Contrary to intuition, the authors find that eliminating redundant or noisy data can be beneficial, especially when done strategically. They experiment with several pruning methods, including recent state-of-the-art approaches, and evaluate them on CelebA-HQ and ImageNet datasets. Surprisingly, a simple clustering method outperforms more complex and computationally demanding techniques. The authors also demonstrate how clustering can be used to balance skewed datasets in an unsupervised manner, allowing for fair sampling of underrepresented populations in the data distribution.
Low GrooveSquid.com (original content) Low Difficulty Summary
Generative models are designed to estimate the underlying distribution of data. So, it’s natural to think that they would benefit from larger datasets. But what if we could trim down these datasets and get rid of some unnecessary information? This paper explores an idea called “data pruning” – identifying the most important parts of a dataset and getting rid of the rest. The researchers tested different methods for doing this with generative models, and found that it can actually make them work better. They also discovered that a simple method called clustering is surprisingly effective at finding what’s important and what’s not. This could have big implications for how we use generative models to create new images or videos that are representative of underrepresented groups.

Keywords

» Artificial intelligence  » Clustering  » Pruning  » Unsupervised