Summary of Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens, by Lijie Fan et al.

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

by Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian

First submitted to arxiv on: 17 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the scaling problem in text-to-image generation using autoregressive models, building upon transformer architectures like BERT and GPT. The study focuses on two key factors: token type (discrete or continuous) and generation order (random or fixed). Results show that all models scale well for validation loss but exhibit different trends in evaluation performance, measured by FID, GenEval score, and visual quality. Continuous tokens yield better visual quality, while random-order models outperform raster-order models in GenEval scores. Inspired by these findings, the authors introduce Fluid, a 10.5B autoregressive model on continuous tokens, achieving a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K and an overall score of 0.69 on the GenEval benchmark.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper explores why some computer programs can’t generate as good images as they do words. The researchers looked at how these programs, called autoregressive models, work when generating images from text descriptions. They found that the type of tokens (the tiny building blocks of language) and the order in which they’re generated make a big difference. Surprisingly, some models can create better-looking images than others just by changing these two things! The authors then created their own model called Fluid, which is really good at generating images from text. They hope that this will inspire other researchers to keep working on making computer-generated images look even more realistic.

Keywords

* Artificial intelligence * Autoregressive * Bert * Gpt * Image generation * Token * Transformer * Zero shot

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

by Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of How Numerical Precision Affects Mathematical Reasoning Capabilities Of Llms, by Guhao Feng et al.

Summary of Exogenous Matching: Learning Good Proposals For Tractable Counterfactual Estimation, by Yikang Chen et al.

Related Posts