Summary of Distilled Decoding 1: One-step Sampling Of Image Auto-regressive Models with Flow Matching, by Enshu Liu et al.
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
by Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin
First submitted to arxiv on: 22 Dec 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This research paper proposes a novel approach called Distilled Decoding (DD) to accelerate autoregressive (AR) model generation. Currently, AR models achieve state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. The authors ask whether a pre-trained AR model can be adapted for one- or two-step generation, which would significantly advance their development and deployment. They recognize that existing works trying to speed up AR generation by generating multiple tokens at once are limited in their ability to capture output distributions due to conditional dependencies between tokens. To address this, the authors propose DD, a method that uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. They then train a network to distill this mapping, enabling few-step generation without requiring the original AR model’s training data. The authors evaluate DD on state-of-the-art image AR models and report promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation with an acceptable increase in FID from 4.19 to 9.96. Similarly, for LlamaGen, DD reduces generation from 256 steps to 1, achieving a 217.8x speed-up with a comparable FID increase from 4.11 to 11.35. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This research paper is about making computers generate images and text faster. Right now, these computers can create really good pictures and writing, but it takes them a long time because they have to do things one step at a time. The researchers asked if there’s a way to make these computers work faster by doing just one or two steps instead of many. They looked at what other people had tried before and found that those methods weren’t very good because they couldn’t capture all the details in the images. To solve this problem, the researchers came up with a new idea called Distilled Decoding (DD). It’s like using a special map to help the computer figure out what it should generate next. They tested DD on some really hard image generation tasks and found that it worked much faster than usual without sacrificing too much quality. |
Keywords
» Artificial intelligence » Autoregressive » Image generation » Token