Summary of Vision-language Meets the Skeleton: Progressively Distillation with Cross-modal Knowledge For 3d Action Representation Learning, by Yang Chen et al.
Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning
by Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, Hong Cheng
First submitted to arxiv on: 31 May 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Skeleton-based action representation learning aims to interpret human behaviors by encoding skeleton sequences. The approach can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations that may impair the skeleton structure. To address these challenges, this paper introduces a novel skeleton-based training framework (C^2VL) based on Cross-modal Contrastive learning. The method uses progressive distillation to learn task-agnostic human skeleton action representation from Vision-Language knowledge prompts generated by pre-trained large multimodal models (LMMs). Specifically, the approach establishes a vision-language action concept space through vision-language knowledge prompts and enriches fine-grained details that the skeleton action space lacks. Moreover, it proposes intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. The soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from noisy skeleton-vision-language pairs. During inference, the method requires only skeleton data as input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that this approach outperforms previous methods and achieves state-of-the-art results. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about learning how to understand human behaviors by looking at skeleton sequences. There are two ways to do this: one method requires a lot of work to define what actions mean, while the other method changes the skeletons in a way that might make it harder to understand them. To solve these problems, the researchers created a new way to learn about human actions using a combination of vision and language. This approach uses a kind of “distillation” process to learn about actions from knowledge prompts generated by special models. The method is able to add more details to the skeleton sequences and make it easier to understand them. When testing this approach on various datasets, it outperformed previous methods and achieved state-of-the-art results. |
Keywords
* Artificial intelligence * Distillation * Inference * Knowledge distillation * Representation learning * Self supervised * Supervised