Loading Now

Summary of Evaluating Vision-language Models on Bistable Images, by Artemis Panagopoulou et al.


Evaluating Vision-Language Models on Bistable Images

by Artemis Panagopoulou, Coby Melkin, Chris Callison-Burch

First submitted to arxiv on: 29 May 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel study examines the performance of 12 vision-language models in classifying and generating bistable images, which can be perceived in two distinct ways. The research involves manual collection of a dataset with 29 images, along with their labels, and applies 116 manipulations to brightness, tint, and rotation. The findings show that most models tend to favor one interpretation over the other, with minimal variance under image manipulations, except for certain exceptions on rotations. A comparison with human preferences reveals that models do not exhibit continuity biases like humans and often diverge from initial interpretations. Furthermore, the study investigates how variations in prompts and labels affect model interpretations, finding a greater influence of language priors than image-text training data. This research contributes to our understanding of vision-language models’ behavior on ambiguous images.
Low GrooveSquid.com (original content) Low Difficulty Summary
Bistable images are special pictures that can be seen in two different ways. Researchers wanted to know how computer programs called vision-language models work with these kinds of images. They collected a big dataset of 29 images and made lots of changes to the brightness, color, and rotation. Then, they tested 12 different computer models on this dataset. The results showed that most computers liked one way of seeing the image more than the other, but some were okay with either interpretation. When compared to human preferences, the computers didn’t behave like humans do, and sometimes gave up their initial ideas. The researchers also found that changes in what they asked the computer to do or how they labeled the images had a big impact on which way the computer saw the image.

Keywords

» Artificial intelligence