Loading Now

Summary of Actnetformer: Transformer-resnet Hybrid Method For Semi-supervised Action Recognition in Videos, by Sharana Dharshikgan Suresh Dass and Hrishav Bakul Barua and Ganesh Krishnasamy and Raveendran Paramesran and Raphael C.-w. Phan


ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

by Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C.-W. Phan

First submitted to arxiv on: 9 Apr 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed semi-supervised action recognition approach uses Cross-Architecture Pseudo-Labeling with contrastive learning to robustly learn action representations in videos. The framework combines pseudo-labeling and contrastive learning for effective learning from both labeled and unlabeled data. A novel cross-architecture approach, ActNetFormer, integrates 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) to capture different aspects of action representations. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each architecture. Experimental results on standard action recognition datasets demonstrate state-of-the-art performance with only a fraction of labeled data.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about using computers to recognize human actions in videos. It’s important because it can help with things like surveillance, self-driving cars, and sports analytics. Right now, we need lots of labeled data (like pictures or videos that are already labeled) to teach computers how to do this. But labeling all that data takes a long time and is expensive. This paper proposes a new way to do action recognition using some existing computer vision techniques and a special combination of two different architectures (3D CNNs and VIT). This new approach works really well and can recognize actions even with only a little bit of labeled data.

Keywords

* Artificial intelligence  * Representation learning  * Semi supervised  * Vit