Loading Now

Summary of Vision-language Models Under Cultural and Inclusive Considerations, by Antonia Karamolegkou et al.


Vision-Language Models under Cultural and Inclusive Considerations

by Antonia Karamolegkou, Phillip Rust, Yong Cao, Ruixiang Cui, Anders Søgaard, Daniel Hershcovich

First submitted to arxiv on: 8 Jul 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a culture-centric evaluation benchmark to assess the reliability of large vision-language models (VLMs) in describing images for visually impaired individuals from diverse cultural backgrounds. To develop this benchmark, the authors conducted a survey to determine caption preferences and filtered the existing VizWiz dataset with images taken by people who are blind. The results show promising performance for state-of-the-art models, but also highlight challenges such as hallucination and misalignment of automatic evaluation metrics with human judgment.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps visually impaired people by testing large computer models that describe pictures. These models might be helpful tools for people who can’t see to understand what’s in their daily life photos. The problem is that the current tests used to evaluate these models don’t include diverse cultural backgrounds or the situations where someone would use this technology. To fix this, the authors asked people with visual impairments how they prefer captions and created a new test dataset using images taken by blind individuals. They then tested several top computer models to see if they can be trusted in real-life situations. While some of these models did well, the authors also found some problems that need to be fixed before this technology can be widely used.

Keywords

» Artificial intelligence  » Hallucination