Summary of What Do You Want? User-centric Prompt Generation For Text-to-image Synthesis Via Multi-turn Guidance, by Yilun Liu et al.
What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance
by Yilun Liu, Minggui He, Feiyu Yao, Yuhe Ji, Shimin Tao, Jingzhou Du, Duan Li, Jian Gao, Li Zhang, Hao Yang, Boxing Chen, Osamu Yoshie
First submitted to arxiv on: 23 Aug 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation, producing high-quality visuals from written descriptions. However, these models heavily rely on the quality and specificity of textual prompts, posing a challenge for novice users who may not be familiar with TIS-model-preferred prompt writing. The existing solutions relieve this via automatic model-preferred prompt generation from user queries. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. To address these issues, we propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasizes user-centricity. DialPrompt is designed to follow a multi-turn guidance workflow, where in each round of dialogue the model queries users with their preferences on possible optimization dimensions before generating the final TIS prompt. Through training on this dataset, DialPrompt can improve interpretability by allowing users to understand the correlation between specific phrases and image attributes. Additionally, it enables greater user control and engagement in the prompt generation process, leading to more personalized and visually satisfying outputs. Experiments indicate that DialPrompt achieves a competitive result in the quality of synthesized images, outperforming existing prompt engineering approaches by 5.7%. Furthermore, in our user evaluation, DialPrompt outperforms existing approaches by 46.5% in user-centricity score and is rated 7.9/10 by 19 human reviewers. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper proposes a new text-to-image synthesis (TIS) model that helps users create better images from written descriptions. This model is called DialPrompt, and it’s designed to work with users in a dialogue, asking them questions about what they want the image to look like before generating it. The authors mined 15 important factors for writing good TIS prompts and used this information to train their model. The goal of DialPrompt is to make it easier for people who aren’t experts in TIS to get the results they want. It does this by letting users understand how specific words or phrases affect the image, and by giving them more control over the prompt generation process. The authors tested DialPrompt and found that it produces better images than existing methods. |
Keywords
» Artificial intelligence » Image synthesis » Optimization » Prompt