Summary of Selma: Learning and Merging Skill-specific Text-to-image Experts with Auto-generated Data, by Jialu Li et al.
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
by Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
First submitted to arxiv on: 11 Mar 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Recent text-to-image (T2I) generation models have shown impressive capabilities in creating images from text descriptions. However, these models often struggle to generate accurate images that match the details of the input text, such as incorrect spatial relationships or missing objects. To address this issue, we introduce SELMA: a novel paradigm for improving the faithfulness of T2I models by fine-tuning them on automatically generated, multi-skill image-text datasets with skill-specific expert learning and merging. Our approach first generates multiple datasets of text prompts that teach different skills using an LLM’s in-context learning capability, then generates images with a T2I model based on the prompts. Next, SELMA adapts the T2I model to new skills by learning single-skill LoRA experts followed by expert merging. We empirically demonstrate that SELMA improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG) and human preference metrics (PickScore, ImageReward, and HPS). Our results also show that fine-tuning with image-text pairs auto-collected via SELMA performs comparably to fine-tuning with ground truth data. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine you want a computer to create an image based on a text description. However, current models often don’t get it right, missing important details or getting the spatial relationships wrong. To solve this problem, researchers developed SELMA, a new way of training these models using multiple skills and expert learning. SELMA generates many different text prompts that teach different skills, then uses those prompts to generate images. The model is fine-tuned on this data to improve its accuracy. As a result, SELMA significantly improves the quality of the generated images, making them more accurate and detailed. This breakthrough could have many practical applications, such as generating images for movies or creating realistic product designs. |
Keywords
* Artificial intelligence * Alignment * Fine tuning * Lora