Loading Now

Summary of Long Video Diffusion Generation with Segmented Cross-attention and Content-rich Video Data Curation, by Xin Yan et al.


Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

by Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang

First submitted to arxiv on: 2 Dec 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Multimedia (cs.MM)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper introduces Presto, a novel video diffusion model that generates 15-second videos with long-range coherence and rich content. The key challenge addressed is maintaining scenario diversity over long durations. To overcome this, the authors propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. This approach requires no additional parameters and can be seamlessly incorporated into current DiT-based architectures. The paper also presents the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experimental results show that Presto outperforms existing state-of-the-art video generation methods on the VBench Semantic Score (78.5%) and Dynamic Degree (100%).
Low GrooveSquid.com (original content) Low Difficulty Summary
Presto is a new way to make videos that are 15 seconds long and look really good. The problem is that it’s hard to keep the same scene going for a long time without getting boring or silly. To fix this, the creators came up with a special trick called Segmented Cross-Attention (SCA). This helps the video generator understand what’s happening in different parts of the video and make sure they fit together well. They also made a big dataset of videos to test their idea on. The results show that Presto is really good at making long videos that are fun and interesting.

Keywords

» Artificial intelligence  » Cross attention  » Diffusion model