Loading Now

Summary of Maskvd: Region Masking For Efficient Video Object Detection, by Sreetama Sarkar et al.


MaskVD: Region Masking for Efficient Video Object Detection

by Sreetama Sarkar, Gourav Datta, Souvik Kundu, Kai Zheng, Chirayata Bhattacharyya, Peter A. Beerel

First submitted to arxiv on: 16 Jul 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a novel strategy for reducing compute-heavy video tasks by leveraging semantic information in images and temporal correlation between frames, achieving significant FLOPs and latency reductions with little performance penalty. The proposed approach, region masking, extracts features from previous frames to skip up to 80% of input regions, improving FLOPs and latency by 3.14x and 1.5x respectively. This is achieved while maintaining similar detection performance as baseline models. The paper demonstrates promising results on Vision Transformers (ViTs) and convolutional neural networks (CNNs), providing latency improvements up to 1.3x using specialized computational kernels.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper makes it possible for state-of-the-art video tasks to be used in real-time applications by reducing the amount of computation needed. This is done by looking at what doesn’t change between frames and skipping those parts. The approach works well with Vision Transformers (ViTs) and other types of neural networks, providing a big improvement in speed while still keeping the same level of performance.

Keywords

* Artificial intelligence