Summary of Jpeg-lm: Llms As Image Generators with Canonical Codec Representations, by Xiaochuang Han et al.

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

by Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov

First submitted to arxiv on: 15 Aug 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper proposes a novel approach to image and video generation by leveraging autoregressive language models (LLMs) and directly modeling compressed file bytes using canonical codecs like JPEG and AVC/H.264. The authors pretrain Llama architecture from scratch, without any vision-specific modifications, to generate images and videos as proof of concept. The evaluation shows that this simple approach is more effective than pixel-based modeling and sophisticated vector quantization baselines, achieving a 31% reduction in FID for image generation. The analysis highlights the advantage of JPEG-LM over vector quantization models in generating long-tail visual elements. This work paves the way for future research on multi-modal language/image/video LLMs.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper uses computers’ compressed files to generate images and videos. It’s like taking a picture or video, but instead of storing it as a bunch of pixels, you store it as a special kind of code that computers understand. This helps make generating images and videos easier and more effective than other methods. The new approach is tested and shown to be better than previous ways of doing things. It’s an important step in developing computers that can create and understand different types of media like pictures, videos, and language.

Keywords

» Artificial intelligence » Autoregressive » Image generation » Llama » Multi modal » Quantization

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

by Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of What Secrets Do Your Manifolds Hold? Understanding the Local Geometry Of Generative Models, by Ahmed Imtiaz Humayun et al.

Summary of Context-aware Assistant Selection For Improved Inference Acceleration with Large Language Models, by Jerry Huang et al.

Related Posts