Summary of Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens, by Lijie Fan et al.
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
by Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian
First submitted to arxiv on: 17 Oct 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
| Summary difficulty | Written by | Summary | 
|---|---|---|
| High | Paper authors | High Difficulty Summary Read the original abstract here | 
| Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the scaling problem in text-to-image generation using autoregressive models, building upon transformer architectures like BERT and GPT. The study focuses on two key factors: token type (discrete or continuous) and generation order (random or fixed). Results show that all models scale well for validation loss but exhibit different trends in evaluation performance, measured by FID, GenEval score, and visual quality. Continuous tokens yield better visual quality, while random-order models outperform raster-order models in GenEval scores. Inspired by these findings, the authors introduce Fluid, a 10.5B autoregressive model on continuous tokens, achieving a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K and an overall score of 0.69 on the GenEval benchmark. | 
| Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper explores why some computer programs can’t generate as good images as they do words. The researchers looked at how these programs, called autoregressive models, work when generating images from text descriptions. They found that the type of tokens (the tiny building blocks of language) and the order in which they’re generated make a big difference. Surprisingly, some models can create better-looking images than others just by changing these two things! The authors then created their own model called Fluid, which is really good at generating images from text. They hope that this will inspire other researchers to keep working on making computer-generated images look even more realistic. | 
Keywords
* Artificial intelligence * Autoregressive * Bert * Gpt * Image generation * Token * Transformer * Zero shot




