Summary of Lime: Less Is More For Mllm Evaluation, by King Zhu et al.
LIME: Less Is More for MLLM Evaluation
by King Zhu, Qianbo Zang, Shian Jia, Siwei Wu, Feiteng Fang, Yizhi Li, Shawn Gavin, Tuney Zheng, Jiawei Guo, Bo Li, Haoning Wu, Xingwei Qu, Jian Yang, Zachary Liu, Xiang Yue, J.H. Liu, Chenghua Lin, Min Yang, Shiwen Ni, Wenhao Huang, Ge Zhang
First submitted to arxiv on: 10 Sep 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Multimodal Large Language Models (MLLMs) are assessed on various tasks, including image captioning, visual question answering, and reasoning. However, existing benchmarks often contain simple or uninformative samples, making it challenging to differentiate between MLLM performance. Moreover, evaluating models across multiple benchmarks incurs significant computational costs. To address these issues, the authors propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated through a semi-automated pipeline. This pipeline filters out uninformative samples, eliminates answer leakage, and focuses on tasks that require image-based understanding. The experiments show that LIME reduces the number of samples by 76% and evaluation time by 77%, while providing a more effective means of distinguishing MLLM capabilities. Notably, traditional automatic metrics like CIDEr are inadequate for assessing MLLMs’ captioning performance; excluding the caption task score yields a more accurate reflection of overall model performance. The authors provide all code and data at this GitHub URL. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Researchers are testing special kinds of AI models that can understand both words and images. However, they’re using tests that are too easy or don’t give a clear picture of how good the models are. This makes it hard to tell which model is better than another. The researchers created a new way to test these models called LIME (Less Is More). It helps by removing unimportant test examples and focusing on tasks that really show what the models can do. The results show that this new method saves time and gives a clearer picture of how well the models are doing. It also shows that some ways of measuring model performance aren’t good enough, especially when it comes to image captioning. |
Keywords
» Artificial intelligence » Image captioning » Question answering