Summary of Assessing Brittleness Of Image-text Retrieval Benchmarks From Vision-language Models Perspective, by Mariya Hendriksen et al.

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

by Mariya Hendriksen, Shuo Zhang, Ridho Reinanda, Mohamed Yahya, Edgar Meij, Maarten de Rijke

First submitted to arxiv on: 21 Jul 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the robustness of image-text retrieval (ITR) evaluation pipelines by examining concept granularity in two common benchmarks: MS-COCO and Flickr30k. The authors create fine-grained versions of these datasets, MS-COCO-FG and Flickr30k-FG, and compare their performance using state-of-the-art Vision-Language models under zero-shot conditions with and without query perturbations. The results show that while perturbations degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts. The study concludes that the issue lies within the benchmarks themselves, providing an agenda for improving ITR evaluation pipelines.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper looks at how well computers can find images when given text descriptions. They test two popular ways of doing this (MS-COCO and Flickr30k) and create new, more detailed versions to see if they work better. Then, they try changing the text descriptions in different ways to see how the computer handles it. The results show that even with these changes, computers can still find images well when given more specific descriptions. This means that the way we evaluate whether computers are good at this task might need to be improved.

Keywords

* Artificial intelligence * Zero shot

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

by Mariya Hendriksen, Shuo Zhang, Ridho Reinanda, Mohamed Yahya, Edgar Meij, Maarten de Rijke

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Passion: Towards Effective Incomplete Multi-modal Medical Image Segmentation with Imbalanced Missing Rates, by Junjie Shi et al.

Summary of Odyssey: Empowering Minecraft Agents with Open-world Skills, by Shunyu Liu et al.

Related Posts