Loading Now

Summary of Tomato: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models, by Ziyao Shangguan et al.


TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

by Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan

First submitted to arxiv on: 30 Oct 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Multimodal Foundation Models (MFMs) have achieved remarkable performance in leveraging temporal context for video understanding, but it’s unclear how well they truly perform visual temporal reasoning. Our study reveals that this capability is likely overestimated, as many questions can be solved using a single, few, or out-of-order frames. We propose three principles and corresponding metrics to systematically examine current visual temporal reasoning tasks: Multi-Frame Gain, Frame Order Sensitivity, and Frame Information Disparity. We introduce TOMATO, a novel benchmark crafted to rigorously assess MFMs’ temporal reasoning capabilities in video understanding. The benchmark comprises 1,484 carefully curated, human-annotated questions spanning six tasks applied to 1,417 videos, including self-recorded and -generated videos that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. We believe TOMATO will serve as a crucial testbed for evaluating next-generation MFMs.
Low GrooveSquid.com (original content) Low Difficulty Summary
Visual temporal reasoning is important in video understanding, but existing benchmarks may not be accurate. A new benchmark called TOMATO helps to fix this by providing a fair way to compare how well models can understand videos. The benchmark has 1,484 questions that are related to six different tasks, such as counting actions or recognizing shapes. These questions are applied to 1,417 videos that show real-world and simulated scenarios. When we tested the best model using TOMATO, it was not very good at understanding videos. This shows that there is still a lot of work to be done to create models that can understand human world dynamics through video.

Keywords

» Artificial intelligence