Loading Now

Summary of Attend-fusion: Efficient Audio-visual Fusion For Video Classification, by Mahrukh Awan et al.


Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

by Mahrukh Awan, Asmar Nadeem, Muhammad Junaid Awan, Armin Mustafa, Syed Sameed Husain

First submitted to arxiv on: 26 Aug 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper proposes a novel audio-visual fusion approach called Attend-Fusion, designed to capture intricate relationships between audio and visual features in video data. Existing methods rely on large model architectures, leading to high computational complexity and resource requirements. In contrast, Attend-Fusion presents a compact architecture that achieves optimal performance on the YouTube-8M dataset with only 72M parameters. This is comparable to larger baseline models like Fully-Connected Late Fusion (341M parameters) while reducing model size by nearly 80%. The approach effectively combines audio and visual information for video classification, making it suitable for deployment in resource-constrained environments.
Low GrooveSquid.com (original content) Low Difficulty Summary
Imagine trying to understand what’s happening in a movie or TV show just by listening to the soundtrack. That’s basically what this paper is about: finding a way to combine sound and visuals to better understand videos. The problem is that most methods need huge computers to work, which isn’t always practical. This new approach, called Attend-Fusion, uses a special kind of architecture that can capture all the important details in both audio and visual parts of a video. It does this really well on a big dataset of YouTube videos and outperforms other similar approaches while using much less computer power.

Keywords

» Artificial intelligence  » Classification