Summary of Orthus: Autoregressive Interleaved Image-text Generation with Modality-specific Heads, by Siqi Kou et al.

Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

by Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng

First submitted to arxiv on: 28 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper introduces Orthus, a novel autoregressive transformer that excels in generating images from text prompts, answering questions based on visual inputs, and creating image-text interleaved contents. Unlike previous unified multimodal models, Orthus simultaneously handles discrete text tokens and continuous image features using an autoregressive principle. The key innovation is the modality-specific heads: one for language modeling (LM) predicting text tokens and another for diffusion generating continuous image features conditioning on the backbone output. To build Orthus efficiently, the authors substitute Vector Quantization with a soft alternative, add a diffusion head, and tune the modules to reconstruct images. They demonstrate that Orthus-base models can be trained quickly (e.g., 72 A100 GPU hours) and fine-tuned for better image-text generation. Experimentally, Orthus outperforms competing baselines like Show-o and Chameleon on standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper introduces a new way to generate images from text prompts or answer questions based on pictures. The new model, called Orthus, can also create image-text combinations that look natural. Unlike other models, Orthus handles both the words and the pictures in a special way that keeps all the information intact. To make this work, the authors created a new part of the model that generates images and another part that predicts text. They show that this works well by testing it on standard benchmarks and comparing it to other models. The results are impressive, with Orthus achieving better scores than the competition.

Keywords

* Artificial intelligence * Autoregressive * Diffusion * Quantization * Text generation * Transformer

Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

by Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Handling Irresolvable Conflicts in the Semantic Web: An Rdf-based Conflict-tolerant Version Of the Deontic Traditional Scheme, by Livio Robaldo and Gianluca Pozzato

Summary of Fine Tuning Large Language Models to Deliver Cbt For Depression, by Talha Tahir

Related Posts