Loading Now

Summary of A Two-stream Hybrid Cnn-transformer Network For Skeleton-based Human Interaction Recognition, by Ruoqi Yin et al.


A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition

by Ruoqi Yin, Jianqin Yin

First submitted to arxiv on: 31 Dec 2023

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The authors propose a novel architecture for human interaction recognition, which they call THCT-Net. This Two-stream Hybrid CNN-Transformer Network combines the strengths of Convolutional Neural Networks (CNNs) and Transformers to model entity, time, and space relationships between interactive entities. The CNN-based stream learns local features using 3D convolutions and multi-head self-attention, while the Transformer-based stream integrates skeleton sequences to learn inter-token correlations. A dual-branch paradigm is used to fuse motion features from raw skeleton coordinates and their temporal differences. Experimental results on diverse datasets demonstrate that THCT-Net outperforms state-of-the-art methods in comprehending and inferring the meaning and context of various actions.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper proposes a new way to recognize human interactions, like people talking or playing together. The current approach uses a combination of two different types of neural networks: Convolutional Neural Networks (CNNs) and Transformers. The CNN part is good at recognizing small details, while the Transformer part is better at understanding how things relate to each other over time and space. The authors combine these strengths in their new architecture, called THCT-Net. They also add a special way of combining information from multiple parts of the body and from different times to make it even more accurate. When tested on lots of different videos, THCT-Net did better than existing methods at understanding what people are doing and why.

Keywords

» Artificial intelligence  » Cnn  » Self attention  » Token  » Transformer