Loading Now

Summary of Multi: Multimodal Understanding Leaderboard with Text and Images, by Zichen Zhu et al.


MULTI: Multimodal Understanding Leaderboard with Text and Images

by Zichen Zhu, Yang Xu, Lu Chen, Jingkai Yang, Yichuan Ma, Yiming Sun, Hailin Wen, Jiaqi Liu, Jinyu Cai, Yingzi Ma, Situo Zhang, Zihan Zhao, Liangtai Sun, Kai Yu

First submitted to arxiv on: 5 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper explores the performance of multimodal large language models (MLLMs) in comparison to human experts. The authors introduce MULTI, a Chinese dataset comprising over 18,000 questions that evaluate models using real-world examination standards. This includes image-text comprehension, complex reasoning, and knowledge recall. The researchers also propose two subsets: MULTI-Elite, featuring 500 carefully selected hard questions, and MULTI-Extend, with more than 4,500 external knowledge context pieces for testing in-context learning capabilities. The evaluation results show that there is still significant room for MLLM advancement, with Qwen2-VL-72B achieving a 76.9% accuracy on MULTI and 53.1% on MULTI-Elite, outperforming 25 evaluated models and human expert baselines of 86.1% and 73.1%. The authors suggest that MULTI not only serves as a robust evaluation platform but also paves the way for the development of expert-level AI.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how well large language models do compared to humans. The researchers created a big dataset called MULTI, with lots of questions that test how well these models can understand and work with images and text. They also made two smaller groups: one with harder questions (MULTI-Elite) and another with extra information to help the models learn (MULTI-Extend). The results show that there’s still a lot for these models to improve on, but they’re getting close! The model Qwen2-VL-72B did really well, even beating some human experts. This research helps us make better language models and eventually create AI that can be as good as humans.

Keywords

» Artificial intelligence  » Recall