Loading Now

Summary of Assessing Brittleness Of Image-text Retrieval Benchmarks From Vision-language Models Perspective, by Mariya Hendriksen et al.


Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

by Mariya Hendriksen, Shuo Zhang, Ridho Reinanda, Mohamed Yahya, Edgar Meij, Maarten de Rijke

First submitted to arxiv on: 21 Jul 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the robustness of image-text retrieval (ITR) evaluation pipelines by examining concept granularity in two common benchmarks: MS-COCO and Flickr30k. The authors create fine-grained versions of these datasets, MS-COCO-FG and Flickr30k-FG, and compare their performance using state-of-the-art Vision-Language models under zero-shot conditions with and without query perturbations. The results show that while perturbations degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts. The study concludes that the issue lies within the benchmarks themselves, providing an agenda for improving ITR evaluation pipelines.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how well computers can find images when given text descriptions. They test two popular ways of doing this (MS-COCO and Flickr30k) and create new, more detailed versions to see if they work better. Then, they try changing the text descriptions in different ways to see how the computer handles it. The results show that even with these changes, computers can still find images well when given more specific descriptions. This means that the way we evaluate whether computers are good at this task might need to be improved.

Keywords

» Artificial intelligence  » Zero shot