Summary of Action2sound: Ambient-aware Generation Of Action Sounds From Egocentric Videos, by Changan Chen et al.
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
by Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
First submitted to arxiv on: 13 Jun 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed AV-LDM model tackles the challenge of generating realistic audio for human actions, a crucial aspect in applications like film sound effects or virtual reality games. Existing approaches assume total correspondence between video and audio during training, but this assumption doesn’t hold true for many real-world scenarios where sounds occur off-screen. The novel ambient-aware audio generation model devises an audio-conditioning mechanism to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a silent video, the model uses retrieval-augmented generation to create audio that matches visual content semantically and temporally. Tested on two egocentric video datasets (Ego4D and EPIC-KITCHENS) with curated clips (Ego4D-Sounds), AV-LDM outperforms existing methods, allowing controllable ambient sound generation and showing promise for generalizing to computer graphics game clips. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary AV-LDM is a new way to create realistic sounds that match what people are doing on camera. Right now, making sounds for movies or video games can be tricky because many sounds happen when the cameras aren’t rolling. Existing methods try to predict these off-screen sounds by looking at what’s happening on screen, but they often get it wrong. The AV-LDM model is better because it learns to separate the important sounds from the background noise in real-world videos. |
Keywords
» Artificial intelligence » Retrieval augmented generation