Loading Now

Summary of Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens, by Lijie Fan et al.


Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

by Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian

First submitted to arxiv on: 17 Oct 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the scaling problem in text-to-image generation using autoregressive models, building upon transformer architectures like BERT and GPT. The study focuses on two key factors: token type (discrete or continuous) and generation order (random or fixed). Results show that all models scale well for validation loss but exhibit different trends in evaluation performance, measured by FID, GenEval score, and visual quality. Continuous tokens yield better visual quality, while random-order models outperform raster-order models in GenEval scores. Inspired by these findings, the authors introduce Fluid, a 10.5B autoregressive model on continuous tokens, achieving a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K and an overall score of 0.69 on the GenEval benchmark.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper explores why some computer programs can’t generate as good images as they do words. The researchers looked at how these programs, called autoregressive models, work when generating images from text descriptions. They found that the type of tokens (the tiny building blocks of language) and the order in which they’re generated make a big difference. Surprisingly, some models can create better-looking images than others just by changing these two things! The authors then created their own model called Fluid, which is really good at generating images from text. They hope that this will inspire other researchers to keep working on making computer-generated images look even more realistic.

Keywords

» Artificial intelligence  » Autoregressive  » Bert  » Gpt  » Image generation  » Token  » Transformer  » Zero shot