Summary of Tomato: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models, by Ziyao Shangguan et al.

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

by Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan

First submitted to arxiv on: 30 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Multimodal Foundation Models (MFMs) have achieved remarkable performance in leveraging temporal context for video understanding, but it’s unclear how well they truly perform visual temporal reasoning. Our study reveals that this capability is likely overestimated, as many questions can be solved using a single, few, or out-of-order frames. We propose three principles and corresponding metrics to systematically examine current visual temporal reasoning tasks: Multi-Frame Gain, Frame Order Sensitivity, and Frame Information Disparity. We introduce TOMATO, a novel benchmark crafted to rigorously assess MFMs’ temporal reasoning capabilities in video understanding. The benchmark comprises 1,484 carefully curated, human-annotated questions spanning six tasks applied to 1,417 videos, including self-recorded and -generated videos that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. We believe TOMATO will serve as a crucial testbed for evaluating next-generation MFMs.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Visual temporal reasoning is important in video understanding, but existing benchmarks may not be accurate. A new benchmark called TOMATO helps to fix this by providing a fair way to compare how well models can understand videos. The benchmark has 1,484 questions that are related to six different tasks, such as counting actions or recognizing shapes. These questions are applied to 1,417 videos that show real-world and simulated scenarios. When we tested the best model using TOMATO, it was not very good at understanding videos. This shows that there is still a lot of work to be done to create models that can understand human world dynamics through video.

Keywords

* Artificial intelligence

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

by Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Backdoor Attack Against Vision Transformers Via Attention Gradient-based Image Erosion, by Ji Guo et al.

Summary of Vl-cache: Sparsity and Modality-aware Kv Cache Compression For Vision-language Model Inference Acceleration, by Dezhan Tu et al.

Related Posts