Loading Now

Summary of Vision-language Meets the Skeleton: Progressively Distillation with Cross-modal Knowledge For 3d Action Representation Learning, by Yang Chen et al.


Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

by Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, Hong Cheng

First submitted to arxiv on: 31 May 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Skeleton-based action representation learning aims to interpret human behaviors by encoding skeleton sequences. The approach can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations that may impair the skeleton structure. To address these challenges, this paper introduces a novel skeleton-based training framework (C^2VL) based on Cross-modal Contrastive learning. The method uses progressive distillation to learn task-agnostic human skeleton action representation from Vision-Language knowledge prompts generated by pre-trained large multimodal models (LMMs). Specifically, the approach establishes a vision-language action concept space through vision-language knowledge prompts and enriches fine-grained details that the skeleton action space lacks. Moreover, it proposes intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. The soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from noisy skeleton-vision-language pairs. During inference, the method requires only skeleton data as input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that this approach outperforms previous methods and achieves state-of-the-art results.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about learning how to understand human behaviors by looking at skeleton sequences. There are two ways to do this: one method requires a lot of work to define what actions mean, while the other method changes the skeletons in a way that might make it harder to understand them. To solve these problems, the researchers created a new way to learn about human actions using a combination of vision and language. This approach uses a kind of “distillation” process to learn about actions from knowledge prompts generated by special models. The method is able to add more details to the skeleton sequences and make it easier to understand them. When testing this approach on various datasets, it outperformed previous methods and achieved state-of-the-art results.

Keywords

* Artificial intelligence  * Distillation  * Inference  * Knowledge distillation  * Representation learning  * Self supervised  * Supervised