Summary of Tomato: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models, by Ziyao Shangguan et al.
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
by Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan
First submitted to arxiv on: 30 Oct 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Multimodal Foundation Models (MFMs) have achieved remarkable performance in leveraging temporal context for video understanding, but it’s unclear how well they truly perform visual temporal reasoning. Our study reveals that this capability is likely overestimated, as many questions can be solved using a single, few, or out-of-order frames. We propose three principles and corresponding metrics to systematically examine current visual temporal reasoning tasks: Multi-Frame Gain, Frame Order Sensitivity, and Frame Information Disparity. We introduce TOMATO, a novel benchmark crafted to rigorously assess MFMs’ temporal reasoning capabilities in video understanding. The benchmark comprises 1,484 carefully curated, human-annotated questions spanning six tasks applied to 1,417 videos, including self-recorded and -generated videos that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. We believe TOMATO will serve as a crucial testbed for evaluating next-generation MFMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Visual temporal reasoning is important in video understanding, but existing benchmarks may not be accurate. A new benchmark called TOMATO helps to fix this by providing a fair way to compare how well models can understand videos. The benchmark has 1,484 questions that are related to six different tasks, such as counting actions or recognizing shapes. These questions are applied to 1,417 videos that show real-world and simulated scenarios. When we tested the best model using TOMATO, it was not very good at understanding videos. This shows that there is still a lot of work to be done to create models that can understand human world dynamics through video. |