Loading Now

Summary of Jpeg-lm: Llms As Image Generators with Canonical Codec Representations, by Xiaochuang Han et al.


JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

by Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov

First submitted to arxiv on: 15 Aug 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a novel approach to image and video generation by leveraging autoregressive language models (LLMs) and directly modeling compressed file bytes using canonical codecs like JPEG and AVC/H.264. The authors pretrain Llama architecture from scratch, without any vision-specific modifications, to generate images and videos as proof of concept. The evaluation shows that this simple approach is more effective than pixel-based modeling and sophisticated vector quantization baselines, achieving a 31% reduction in FID for image generation. The analysis highlights the advantage of JPEG-LM over vector quantization models in generating long-tail visual elements. This work paves the way for future research on multi-modal language/image/video LLMs.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper uses computers’ compressed files to generate images and videos. It’s like taking a picture or video, but instead of storing it as a bunch of pixels, you store it as a special kind of code that computers understand. This helps make generating images and videos easier and more effective than other methods. The new approach is tested and shown to be better than previous ways of doing things. It’s an important step in developing computers that can create and understand different types of media like pictures, videos, and language.

Keywords

» Artificial intelligence  » Autoregressive  » Image generation  » Llama  » Multi modal  » Quantization