Loading Now

Summary of Tablevqa-bench: a Visual Question Answering Benchmark on Multiple Table Domains, by Yoonsik Kim et al.


TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

by Yoonsik Kim, Moonbin Yim, Ka Yeon Song

First submitted to arxiv on: 30 Apr 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces the TableVQA-Bench, a benchmark for table visual question answering (TableVQA). The authors address the lack of datasets incorporating images and QA pairs in existing table structure recognition and question-answering datasets. They propose a stylesheet-based method to obtain images and generate QA pairs using a large language model (LLM) as input. The TableVQA-Bench comprises 1,500 QA pairs and is used to evaluate the performance of various multi-modal LLMs (MLLMs). GPT-4V achieves the highest accuracy among tested MLLMs. The study also investigates the effect of vision queries on TableVQA performance and finds that processing visual inputs is more challenging than text inputs for MLLMs, despite higher computational costs. The proposed benchmark and evaluation codes are available online.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper creates a new way to test how well computers can answer questions about tables by combining images and written descriptions of the tables with the questions. The authors make datasets that have these combinations, which they call TableVQA-Bench. They use these datasets to compare different computer programs that try to answer table-based questions. One program, GPT-4V, does best in this comparison. The study also shows that computers find it harder to understand visual information than text information. All the data and codes from this research are available online.

Keywords

» Artificial intelligence  » Gpt  » Large language model  » Multi modal  » Question answering