Summary of Actnetformer: Transformer-resnet Hybrid Method For Semi-supervised Action Recognition in Videos, by Sharana Dharshikgan Suresh Dass and Hrishav Bakul Barua and Ganesh Krishnasamy and Raveendran Paramesran and Raphael C.-w. Phan
ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos
by Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C.-W. Phan
First submitted to arxiv on: 9 Apr 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed semi-supervised action recognition approach uses Cross-Architecture Pseudo-Labeling with contrastive learning to robustly learn action representations in videos. The framework combines pseudo-labeling and contrastive learning for effective learning from both labeled and unlabeled data. A novel cross-architecture approach, ActNetFormer, integrates 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) to capture different aspects of action representations. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each architecture. Experimental results on standard action recognition datasets demonstrate state-of-the-art performance with only a fraction of labeled data. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about using computers to recognize human actions in videos. It’s important because it can help with things like surveillance, self-driving cars, and sports analytics. Right now, we need lots of labeled data (like pictures or videos that are already labeled) to teach computers how to do this. But labeling all that data takes a long time and is expensive. This paper proposes a new way to do action recognition using some existing computer vision techniques and a special combination of two different architectures (3D CNNs and VIT). This new approach works really well and can recognize actions even with only a little bit of labeled data. |
Keywords
* Artificial intelligence * Representation learning * Semi supervised * Vit