Loading Now

Summary of Audio-visual Generalized Zero-shot Learning the Easy Way, by Shentong Mo et al.


Audio-visual Generalized Zero-shot Learning the Easy Way

by Shentong Mo, Pedro Morgado

First submitted to arxiv on: 18 Jul 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed framework for Easy Audio-Visual Generalized Zero-shot Learning (EZ-AVGZL) addresses the limitations of prior approaches by introducing a simple yet effective method that aligns audio-visual embeddings with transformed text representations. This framework utilizes a single supervised text-audio-visual contrastive loss to learn an alignment between modalities, moving away from reconstructing cross-modal features and text embeddings. The authors’ key insight is that class name embeddings are well-aligned with language-based audio-visual features but don’t provide sufficient class separation for zero-shot learning. To address this, EZ-AVGZL leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. Experimental results on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks demonstrate that EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper introduces a new approach to recognizing sounds in videos even if the videos are from classes that were not seen before. This is called “audio-visual generalized zero-shot learning.” The authors say that previous methods didn’t work well because they tried to reconstruct both the audio and visual parts of the video, which isn’t the best way to learn about the relationships between sounds and images. Instead, the new method aligns the audio and visual parts with text descriptions. This helps the model understand how to recognize sounds in videos even if it has never seen that type of sound before.

Keywords

* Artificial intelligence  * Alignment  * Contrastive loss  * Optimization  * Supervised  * Zero shot