Summary of Jetformer: An Autoregressive Generative Model Of Raw Images and Text, by Michael Tschannen et al.
JetFormer: An Autoregressive Generative Model of Raw Images and Text
by Michael Tschannen, André Susano Pinto, Alexander Kolesnikov
First submitted to arxiv on: 29 Nov 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A recent breakthrough in multimodal modeling has been achieving large-scale progress by removing constraints and unifying architectures across domains. However, most models still rely on separately trained components such as modality-specific encoders and decoders. This paper proposes a novel approach to streamline joint generative modeling of images and text using an autoregressive decoder-only transformer called JetFormer. Unlike existing models, JetFormer is trained directly to maximize the likelihood of raw data without relying on any separately pretrained components, allowing it to understand and generate both text and images. The model combines a normalizing flow model as an image encoder for perception tasks and an image decoder for image generation tasks during inference. Compared to recent VQ-VAE- and VAE-based baselines that rely on complex perceptual losses, JetFormer achieves competitive text-to-image generation quality while demonstrating robust image understanding capabilities. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper makes it easier to create pictures and text together using a special computer model called JetFormer. Instead of needing many separate parts like most models do, JetFormer does everything by itself. It can make new pictures that look similar to real ones, and it can also understand what’s in those pictures. This is the first time anyone has made a model that can do all this. |
Keywords
» Artificial intelligence » Autoregressive » Decoder » Encoder » Image generation » Inference » Likelihood » Transformer