Summary of Audio-visual Generalized Zero-shot Learning the Easy Way, by Shentong Mo et al.
Audio-visual Generalized Zero-shot Learning the Easy Way
by Shentong Mo, Pedro Morgado
First submitted to arxiv on: 18 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed framework for Easy Audio-Visual Generalized Zero-shot Learning (EZ-AVGZL) addresses the limitations of prior approaches by introducing a simple yet effective method that aligns audio-visual embeddings with transformed text representations. This framework utilizes a single supervised text-audio-visual contrastive loss to learn an alignment between modalities, moving away from reconstructing cross-modal features and text embeddings. The authors’ key insight is that class name embeddings are well-aligned with language-based audio-visual features but don’t provide sufficient class separation for zero-shot learning. To address this, EZ-AVGZL leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. Experimental results on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks demonstrate that EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper introduces a new approach to recognizing sounds in videos even if the videos are from classes that were not seen before. This is called “audio-visual generalized zero-shot learning.” The authors say that previous methods didn’t work well because they tried to reconstruct both the audio and visual parts of the video, which isn’t the best way to learn about the relationships between sounds and images. Instead, the new method aligns the audio and visual parts with text descriptions. This helps the model understand how to recognize sounds in videos even if it has never seen that type of sound before. |
Keywords
* Artificial intelligence * Alignment * Contrastive loss * Optimization * Supervised * Zero shot