Summary of Assessing Brittleness Of Image-text Retrieval Benchmarks From Vision-language Models Perspective, by Mariya Hendriksen et al.
Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective
by Mariya Hendriksen, Shuo Zhang, Ridho Reinanda, Mohamed Yahya, Edgar Meij, Maarten de Rijke
First submitted to arxiv on: 21 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the robustness of image-text retrieval (ITR) evaluation pipelines by examining concept granularity in two common benchmarks: MS-COCO and Flickr30k. The authors create fine-grained versions of these datasets, MS-COCO-FG and Flickr30k-FG, and compare their performance using state-of-the-art Vision-Language models under zero-shot conditions with and without query perturbations. The results show that while perturbations degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts. The study concludes that the issue lies within the benchmarks themselves, providing an agenda for improving ITR evaluation pipelines. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how well computers can find images when given text descriptions. They test two popular ways of doing this (MS-COCO and Flickr30k) and create new, more detailed versions to see if they work better. Then, they try changing the text descriptions in different ways to see how the computer handles it. The results show that even with these changes, computers can still find images well when given more specific descriptions. This means that the way we evaluate whether computers are good at this task might need to be improved. |
Keywords
» Artificial intelligence » Zero shot