Summary of Jetformer: An Autoregressive Generative Model Of Raw Images and Text, by Michael Tschannen et al.

JetFormer: An Autoregressive Generative Model of Raw Images and Text

by Michael Tschannen, André Susano Pinto, Alexander Kolesnikov

First submitted to arxiv on: 29 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary A recent breakthrough in multimodal modeling has been achieving large-scale progress by removing constraints and unifying architectures across domains. However, most models still rely on separately trained components such as modality-specific encoders and decoders. This paper proposes a novel approach to streamline joint generative modeling of images and text using an autoregressive decoder-only transformer called JetFormer. Unlike existing models, JetFormer is trained directly to maximize the likelihood of raw data without relying on any separately pretrained components, allowing it to understand and generate both text and images. The model combines a normalizing flow model as an image encoder for perception tasks and an image decoder for image generation tasks during inference. Compared to recent VQ-VAE- and VAE-based baselines that rely on complex perceptual losses, JetFormer achieves competitive text-to-image generation quality while demonstrating robust image understanding capabilities.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper makes it easier to create pictures and text together using a special computer model called JetFormer. Instead of needing many separate parts like most models do, JetFormer does everything by itself. It can make new pictures that look similar to real ones, and it can also understand what’s in those pictures. This is the first time anyone has made a model that can do all this.

Keywords

» Artificial intelligence » Autoregressive » Decoder » Encoder » Image generation » Inference » Likelihood » Transformer

JetFormer: An Autoregressive Generative Model of Raw Images and Text

by Michael Tschannen, André Susano Pinto, Alexander Kolesnikov

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Forensics Adapter: Adapting Clip For Generalizable Face Forgery Detection, by Xinjie Cui et al.

Summary of A Visual-inertial Localization Algorithm Using Opportunistic Visual Beacons and Dead-reckoning For Gnss-denied Large-scale Applications, by Liqiang Zhang et al.

Related Posts