Summary of Mumu: Bootstrapping Multimodal Image Generation From Text-to-image Data, by William Berman et al.
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
by William Berman, Alexander Peysakhovich
First submitted to arxiv on: 26 Jun 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A multimodal model, MUMU, is trained to generate images from prompts combining text and images. The model is composed of a vision-language encoder and diffusion decoder, and is trained on a single GPU node. Despite being trained only on cropped images from the same dataset, MUMU learns to combine inputs from different images into coherent outputs. For example, it can transform a realistic person into a cartoon character or change a standing subject riding a scooter. The model generalizes well to tasks like style transfer and character consistency. This demonstrates the potential of multimodal models as general-purpose image generation controllers. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine a special kind of computer program that can create new images based on words and pictures. You could tell it to turn a normal person into a cartoon, or make someone ride a scooter. This program is called MUMU, and it’s very good at doing this. It learned how to do these things by looking at lots of images with text captions. Even though the program only saw these images once, it can still create new pictures that are kind of like what it learned from before. This is important because it could help us make even more cool and realistic computer-generated images in the future. |
Keywords
» Artificial intelligence » Decoder » Diffusion » Encoder » Image generation » Style transfer