Loading Now

Summary of Automatic Evaluation For Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark, by Rong-cheng Tu et al.


Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark

by Rong-Cheng Tu, Zi-Ao Ma, Tian Lan, Yuehao Zhao, Heyan Huang, Xian-Ling Mao

First submitted to arxiv on: 23 Nov 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper addresses the pressing need for automatic quality evaluation of generated images in text-to-image generation models. The current state-of-the-art methods rely on powerful commercial Multi-modal Large Language Models (MLLMs) like GPT-4o, which are highly effective but limited by their substantial costs, making large-scale evaluations challenging. To overcome these limitations, the authors propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. They then design innovative training strategies to distill GPT-4o’s evaluation capabilities into an open-source MLLM, MiniCPM-V-2.6. The authors also create a meta-evaluation benchmark with chain-of-thought explanations and quality scores for generated images to reliably assess prior works and their proposed model. Experimental results show that the distilled open-source MLLM outperforms the current state-of-the-art GPT-4o-base baseline, VIEScore, by over 4.6% in Spearman and Kendall correlations with human judgments.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about creating a better way to evaluate the quality of images generated from text. Right now, we use powerful computer models like GPT-4o to do this evaluation, but these models are very expensive and can’t be used for large-scale evaluations. To solve this problem, the authors propose a new approach that breaks down the evaluation task into smaller tasks, making it easier for computers to learn from. They also create a special benchmark to test their method and show that it outperforms existing methods.

Keywords

» Artificial intelligence  » Gpt  » Image generation  » Multi modal