Summary of Xgen-videosyn-1: High-fidelity Text-to-video Synthesis with Compressed Representations, by Can Qin et al.
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
by Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong
First submitted to arxiv on: 22 Aug 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces xGen-VideoSyn-1, a text-to-video (T2V) generation model that produces realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI’s Sora, the authors explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). The VidVAE compresses video data spatially and temporally, reducing computational demands for generating long-sequence videos. To address computational costs, the authors propose a divide-and-merge strategy that maintains temporal consistency across video segments. The model incorporates spatial and temporal self-attention layers in the Diffusion Transformer (DiT) to enable robust generalization. The paper also presents a data processing pipeline from scratch, collecting over 13M high-quality video-text pairs. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. The model supports end-to-end generation of over 14-second 720p videos and demonstrates competitive performance against state-of-the-art T2V models. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about creating realistic videos from text descriptions using a special kind of AI called xGen-VideoSyn-1. It’s like having a super-powerful camera that can turn words into moving pictures! The researchers used some fancy math and computer programming to make this happen. They also collected a huge amount of data – over 13 million video-text pairs – to help their AI learn how to generate videos. This technology could be really useful for things like creating movies, TV shows, or even virtual reality experiences. |
Keywords
* Artificial intelligence * Diffusion * Diffusion model * Generalization * Self attention * Transformer * Variational autoencoder