Loading Now

Summary of What Do You Want? User-centric Prompt Generation For Text-to-image Synthesis Via Multi-turn Guidance, by Yilun Liu et al.


What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance

by Yilun Liu, Minggui He, Feiyu Yao, Yuhe Ji, Shimin Tao, Jingzhou Du, Duan Li, Jian Gao, Li Zhang, Hao Yang, Boxing Chen, Osamu Yoshie

First submitted to arxiv on: 23 Aug 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation, producing high-quality visuals from written descriptions. However, these models heavily rely on the quality and specificity of textual prompts, posing a challenge for novice users who may not be familiar with TIS-model-preferred prompt writing. The existing solutions relieve this via automatic model-preferred prompt generation from user queries. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. To address these issues, we propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasizes user-centricity. DialPrompt is designed to follow a multi-turn guidance workflow, where in each round of dialogue the model queries users with their preferences on possible optimization dimensions before generating the final TIS prompt. Through training on this dataset, DialPrompt can improve interpretability by allowing users to understand the correlation between specific phrases and image attributes. Additionally, it enables greater user control and engagement in the prompt generation process, leading to more personalized and visually satisfying outputs. Experiments indicate that DialPrompt achieves a competitive result in the quality of synthesized images, outperforming existing prompt engineering approaches by 5.7%. Furthermore, in our user evaluation, DialPrompt outperforms existing approaches by 46.5% in user-centricity score and is rated 7.9/10 by 19 human reviewers.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper proposes a new text-to-image synthesis (TIS) model that helps users create better images from written descriptions. This model is called DialPrompt, and it’s designed to work with users in a dialogue, asking them questions about what they want the image to look like before generating it. The authors mined 15 important factors for writing good TIS prompts and used this information to train their model. The goal of DialPrompt is to make it easier for people who aren’t experts in TIS to get the results they want. It does this by letting users understand how specific words or phrases affect the image, and by giving them more control over the prompt generation process. The authors tested DialPrompt and found that it produces better images than existing methods.

Keywords

» Artificial intelligence  » Image synthesis  » Optimization  » Prompt