Loading Now

Summary of Orthus: Autoregressive Interleaved Image-text Generation with Modality-specific Heads, by Siqi Kou et al.


Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

by Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng

First submitted to arxiv on: 28 Nov 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces Orthus, a novel autoregressive transformer that excels in generating images from text prompts, answering questions based on visual inputs, and creating image-text interleaved contents. Unlike previous unified multimodal models, Orthus simultaneously handles discrete text tokens and continuous image features using an autoregressive principle. The key innovation is the modality-specific heads: one for language modeling (LM) predicting text tokens and another for diffusion generating continuous image features conditioning on the backbone output. To build Orthus efficiently, the authors substitute Vector Quantization with a soft alternative, add a diffusion head, and tune the modules to reconstruct images. They demonstrate that Orthus-base models can be trained quickly (e.g., 72 A100 GPU hours) and fine-tuned for better image-text generation. Experimentally, Orthus outperforms competing baselines like Show-o and Chameleon on standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper introduces a new way to generate images from text prompts or answer questions based on pictures. The new model, called Orthus, can also create image-text combinations that look natural. Unlike other models, Orthus handles both the words and the pictures in a special way that keeps all the information intact. To make this work, the authors created a new part of the model that generates images and another part that predicts text. They show that this works well by testing it on standard benchmarks and comparing it to other models. The results are impressive, with Orthus achieving better scores than the competition.

Keywords

» Artificial intelligence  » Autoregressive  » Diffusion  » Quantization  » Text generation  » Transformer