Loading Now

Summary of Motion-grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level, by Andong Deng et al.


Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

by Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, Chen Chen

First submitted to arxiv on: 15 Nov 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper introduces Motion-Grounded Video Reasoning, a new task that requires generating visual answers to input questions, demonstrating implicit spatiotemporal reasoning and grounding. This extends existing work on explicit action/motion grounding, enabling models to reason implicitly via questions. To facilitate this task, the authors collect the GROUNDMORE dataset, comprising 1,715 video clips, 249K object masks, and four question types (Causal, Sequential, Counterfactual, and Descriptive) for benchmarking deep motion reasoning abilities. The unique requirement of generating visual answers provides a more concrete and interpretable response than plain texts, evaluating models on both spatiotemporal grounding and reasoning. A novel baseline model, MORA, is introduced, combining multimodal reasoning from Multimodal LLM, pixel-level perception from SAM, and temporal perception from a lightweight localization head. MORA outperforms the best existing visual grounding baseline by an average of 21.5% relatively.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper introduces a new way to understand videos called Motion-Grounded Video Reasoning. It’s like trying to answer questions about what is happening in a video, and you need to figure out what actions are taking place over time. To make this happen, the authors collected a huge dataset of 1,715 videos and 249,000 object masks, with four types of questions that require different kinds of reasoning. The goal is to get computers to understand videos better by making them generate answers instead of just saying “yes” or “no”. A new computer model called MORA was created to do this task well, and it’s a big step forward in understanding videos.

Keywords

» Artificial intelligence  » Grounding  » Sam  » Spatiotemporal