Loading Now

Summary of On Occlusions in Video Action Detection: Benchmark Datasets and Training Recipes, by Rajat Modi et al.


On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes

by Rajat Modi, Vibhav Vineet, Yogesh Singh Rawat

First submitted to arxiv on: 25 Oct 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper examines the impact of occlusions on video action detection. The study introduces five new benchmark datasets, including O-UCF and O-JHMDB with synthetic controlled static/dynamic occlusions, OVIS-UCF and OVIS-JHMDB with realistic motions, and Real-OUCF for occlusions in real-world scenarios. Existing models are found to suffer significantly as occlusion severity increases, exhibiting different behaviors when occluders are static or moving. The research discovers several intriguing phenomena emerging in neural nets, including transformers outperforming CNNs, incorporating symbolic components like capsules allowing binding to unseen occluders, and the emergence of islands of agreement without instance-level supervision. These properties enable simple yet effective training recipes that lead to robust occlusion models, outperforming existing video action-detectors by 32.3% on O-UCF, 32.7% on O-JHMDB, and 2.6% on Real-OUCF in terms of the vMAP metric.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper explores how objects moving in front of or behind other objects affect our ability to recognize actions in videos. To study this, the researchers created five new datasets with different types of occlusions. They found that existing methods get worse as the level of occlusion increases and behave differently when the occluder is still or moving. The team also discovered some interesting things about how neural networks work, like how transformers can be better than CNNs in certain situations. This knowledge helps them develop simple recipes for training models that can handle occlusions well. These new methods outperform existing ones by a lot!

Keywords

» Artificial intelligence