Loading Now

Summary of Evaluating the Quality Of Hallucination Benchmarks For Large Vision-language Models, by Bei Yan et al.


Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

by Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

First submitted to arxiv on: 24 Jun 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Despite the impressive performance of Large Vision-Language Models (LVLMs) in recent years, they have been plagued by hallucination issues. Hallucination refers to LVLMs generating responses that are inconsistent with their visual inputs. To evaluate the degree of hallucination, previous works proposed various benchmarks featuring different tasks and evaluation metrics. However, we find that existing benchmarks vary in quality, with some experiencing inconsistencies under repeated tests and misalignment with human evaluations. To address this issue, we propose a Hallucination Benchmark Quality Measurement framework (HQM), which assesses the reliability and validity of existing benchmarks using indicators like test-retest reliability, parallel-forms reliability, criterion validity, and coverage of hallucination types. Our HQM framework also allows us to construct a High-Quality Hallucination Benchmark (HQH) for LVLMs, which demonstrates superior reliability and validity. We conduct an extensive evaluation of over 10 representative LVLMs, including GPT-4o and Gemini-1.5-Pro, to analyze the hallucination issues in existing models. Our benchmark is publicly available.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper talks about a problem with computers that can understand images and text (Large Vision-Language Models). These computers often make mistakes by generating responses that don’t match what they’re seeing. Researchers have tried to evaluate these mistakes, but they’ve found that some ways of measuring the mistakes are not good because they’re inconsistent or don’t agree with how humans would measure them. To solve this problem, the researchers created a new way to measure the quality of these mistake-evaluation methods (Hallucination Benchmark Quality Measurement framework). They also made a better benchmark (High-Quality Hallucination Benchmark) that is more reliable and accurate. The researchers tested many computer models using this new benchmark to see how well they can understand images and text.

Keywords

» Artificial intelligence  » Gemini  » Gpt  » Hallucination