Summary of A Two-stream Hybrid Cnn-transformer Network For Skeleton-based Human Interaction Recognition, by Ruoqi Yin et al.
A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition
by Ruoqi Yin, Jianqin Yin
First submitted to arxiv on: 31 Dec 2023
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The authors propose a novel architecture for human interaction recognition, which they call THCT-Net. This Two-stream Hybrid CNN-Transformer Network combines the strengths of Convolutional Neural Networks (CNNs) and Transformers to model entity, time, and space relationships between interactive entities. The CNN-based stream learns local features using 3D convolutions and multi-head self-attention, while the Transformer-based stream integrates skeleton sequences to learn inter-token correlations. A dual-branch paradigm is used to fuse motion features from raw skeleton coordinates and their temporal differences. Experimental results on diverse datasets demonstrate that THCT-Net outperforms state-of-the-art methods in comprehending and inferring the meaning and context of various actions. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper proposes a new way to recognize human interactions, like people talking or playing together. The current approach uses a combination of two different types of neural networks: Convolutional Neural Networks (CNNs) and Transformers. The CNN part is good at recognizing small details, while the Transformer part is better at understanding how things relate to each other over time and space. The authors combine these strengths in their new architecture, called THCT-Net. They also add a special way of combining information from multiple parts of the body and from different times to make it even more accurate. When tested on lots of different videos, THCT-Net did better than existing methods at understanding what people are doing and why. |
Keywords
» Artificial intelligence » Cnn » Self attention » Token » Transformer