Loading Now

Summary of Selma: Learning and Merging Skill-specific Text-to-image Experts with Auto-generated Data, by Jialu Li et al.


SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

by Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal

First submitted to arxiv on: 11 Mar 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Recent text-to-image (T2I) generation models have shown impressive capabilities in creating images from text descriptions. However, these models often struggle to generate accurate images that match the details of the input text, such as incorrect spatial relationships or missing objects. To address this issue, we introduce SELMA: a novel paradigm for improving the faithfulness of T2I models by fine-tuning them on automatically generated, multi-skill image-text datasets with skill-specific expert learning and merging. Our approach first generates multiple datasets of text prompts that teach different skills using an LLM’s in-context learning capability, then generates images with a T2I model based on the prompts. Next, SELMA adapts the T2I model to new skills by learning single-skill LoRA experts followed by expert merging. We empirically demonstrate that SELMA improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG) and human preference metrics (PickScore, ImageReward, and HPS). Our results also show that fine-tuning with image-text pairs auto-collected via SELMA performs comparably to fine-tuning with ground truth data.
Low GrooveSquid.com (original content) Low Difficulty Summary
Imagine you want a computer to create an image based on a text description. However, current models often don’t get it right, missing important details or getting the spatial relationships wrong. To solve this problem, researchers developed SELMA, a new way of training these models using multiple skills and expert learning. SELMA generates many different text prompts that teach different skills, then uses those prompts to generate images. The model is fine-tuned on this data to improve its accuracy. As a result, SELMA significantly improves the quality of the generated images, making them more accurate and detailed. This breakthrough could have many practical applications, such as generating images for movies or creating realistic product designs.

Keywords

* Artificial intelligence  * Alignment  * Fine tuning  * Lora