Loading Now

Summary of Guided Latent Slot Diffusion For Object-centric Learning, by Krishnakant Singh et al.


Guided Latent Slot Diffusion for Object-Centric Learning

by Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

First submitted to arxiv on: 25 Jul 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces Guided Latent Slot Diffusion (GLASS), an object-centric model that uses generated captions to improve slot attention in images. The goal is to decompose input images into meaningful object representations, enabling various downstream tasks. However, existing slot attention methods often fail to accurately represent objects themselves, particularly for real-world datasets. GLASS addresses this issue by learning the slot-attention module in the space of generated images, allowing it to repurpose a pre-trained diffusion decoder model as a semantic mask generator based on generated captions. The model learns an object-level representation suitable for multiple tasks simultaneously, outperforming previous methods. For example, GLASS achieves a +35% and +10% relative improvement over the state-of-the-art method on VOC and COCO datasets, respectively, and sets a new state-of-the-art FID score for conditional image generation amongst slot-attention-based methods.
Low GrooveSquid.com (original content) Low Difficulty Summary
GLASS is a new way to look at pictures. It tries to break down an image into smaller parts that are like containers for objects. These “slots” can be used for lots of different tasks, like finding things in the picture or making new images. But sometimes these slots get stuck on tiny parts of the object instead of the whole thing. GLASS uses words about what’s in the picture to help it focus on the right objects. This makes it better at all sorts of tasks than previous methods. For example, it can find things in pictures really well and even make new images that look like they were taken by a camera.

Keywords

* Artificial intelligence  * Attention  * Decoder  * Diffusion  * Image generation  * Mask