Summary of Mmdu: a Multi-turn Multi-image Dialog Understanding Benchmark and Instruction-tuning Dataset For Lvlms, by Ziyu Liu et al.
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
by Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang
First submitted to arxiv on: 17 Jun 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper introduces MMDU, a comprehensive benchmark and MMDU-45k, a large-scale instruction tuning dataset designed to evaluate and improve Large Vision-Language Models (LVLMs) in multi-turn and multi-image conversations. The current open-source LVLMs demonstrate promising performance in simplified scenarios but fall short in real-world conversation scenarios. The authors employ clustering algorithms and human annotators with the assistance of GPT-4o model to construct question-answer pairs from Wikipedia, resulting in a benchmark with 18k image+text tokens, 20 images, and 27 turns. The analysis of 15 representative LVLMs reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. Fine-tuning open-source LVLMs on MMDU-45k significantly addresses this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making computers better at understanding and responding to human language. It’s trying to solve a big problem: right now, computer models are great at answering simple questions, but they struggle when humans give them multiple instructions or show them multiple images. The authors created a new way to test these computer models, called MMDU, which is much harder than previous tests. They used this test to see how well 15 different computer models did, and found that the open-source models (which are free for anyone to use) didn’t do as well as the closed-source models (which are only available to certain companies). By training the open-source models on a bigger dataset, they were able to make them much better at understanding human language. |
Keywords
» Artificial intelligence » Clustering » Fine tuning » Gpt » Instruction tuning