Summary of Jpeg-lm: Llms As Image Generators with Canonical Codec Representations, by Xiaochuang Han et al.
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
by Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov
First submitted to arxiv on: 15 Aug 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel approach to image and video generation by leveraging autoregressive language models (LLMs) and directly modeling compressed file bytes using canonical codecs like JPEG and AVC/H.264. The authors pretrain Llama architecture from scratch, without any vision-specific modifications, to generate images and videos as proof of concept. The evaluation shows that this simple approach is more effective than pixel-based modeling and sophisticated vector quantization baselines, achieving a 31% reduction in FID for image generation. The analysis highlights the advantage of JPEG-LM over vector quantization models in generating long-tail visual elements. This work paves the way for future research on multi-modal language/image/video LLMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper uses computers’ compressed files to generate images and videos. It’s like taking a picture or video, but instead of storing it as a bunch of pixels, you store it as a special kind of code that computers understand. This helps make generating images and videos easier and more effective than other methods. The new approach is tested and shown to be better than previous ways of doing things. It’s an important step in developing computers that can create and understand different types of media like pictures, videos, and language. |
Keywords
» Artificial intelligence » Autoregressive » Image generation » Llama » Multi modal » Quantization