Summary of Greedy Growing Enables High-resolution Pixel-based Diffusion Models, by Cristina N. Vasconcelos et al.
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
by Cristina N. Vasconcelos, Abdullah Rashwan, Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston, Su Wang, Wenlei Zhou, Kevin Swersky, David J. Fleet, Jason M. Baldridge, Oliver Wang
First submitted to arxiv on: 27 May 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper tackles the issue of learning effective pixel-based image diffusion models at scale, introducing a simple greedy growing method for training large-scale, high-resolution models without requiring cascaded super-resolution components. The key insight comes from pre-training core components responsible for text-to-image alignment and high-resolution rendering. The authors first demonstrate the benefits of scaling a shallow UNet architecture with no downsampling or upsampling, improving alignment, object structure, and composition. Building on this core model, they propose a greedy algorithm that grows the architecture into high-resolution end-to-end models while preserving pre-trained representations, stabilizing training, and reducing the need for large high-resolution datasets. The authors show that their method enables a single-stage model capable of generating high-resolution images without requiring super-resolution cascades. Their key results rely on public datasets and demonstrate the ability to train non-cascaded models up to 8 billion parameters with no further regularization schemes. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps us learn better image diffusion models, which are important for things like image generation and editing. The authors came up with a simple way to make these models bigger and more powerful without needing lots of extra data or complicated training methods. They did this by starting with a basic model that works well for smaller images and then adding more parts to make it work for larger images. This allowed them to train models with many billions of parameters, which is really big! The authors tested their method using public datasets and found that the resulting images were preferred by human evaluators over previous methods. |
Keywords
» Artificial intelligence » Alignment » Diffusion » Image generation » Regularization » Super resolution » Unet