Summary of Mj-bench: Is Your Multimodal Reward Model Really a Good Judge For Text-to-image Generation?, by Zhaorun Chen et al.
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
by Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
First submitted to arxiv on: 5 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel multimodal benchmark called MJ-Bench is introduced to evaluate the capabilities and limitations of various judges in providing feedback for image generation models. The benchmark incorporates a comprehensive preference dataset that assesses judges across four key perspectives: alignment, safety, image quality, and bias. A range of multimodal judges, including CLIP-based scoring models, open-source VLMs (e.g., LLaVA family), and close-source VLMs (e.g., GPT-4o, Claude 3), are evaluated on each decomposed subcategory of the preference dataset. The results show that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Open-source VLMs and smaller-sized scoring models excel in specific areas, while human evaluations confirm the effectiveness of MJ-Bench. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Image generation models like DALLE-3 and Stable Diffusion can create unsafe or low-quality output if they’re not aligned with desired behaviors. To fix this, we need to improve multimodal judges that give feedback on these models. A new benchmark called MJ-Bench helps us evaluate these judges better. We tested many different kinds of judges and found that some are better than others at giving good feedback. The best ones are close-source VLMs like GPT-4o. These judges can help us make sure our image generation models don’t create unsafe or biased output. |
Keywords
» Artificial intelligence » Alignment » Claude » Diffusion » Gpt » Image generation