Loading Now

Summary of Snap Video: Scaled Spatiotemporal Transformers For Text-to-video Synthesis, by Willi Menapace et al.


Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

by Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, Sergey Tulyakov

First submitted to arxiv on: 22 Feb 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The research community has been leveraging image generation models to produce high-quality and versatile videos. However, this approach can lead to reduced motion fidelity, visual quality, and scalability issues due to the highly redundant nature of video content. To address these challenges, we introduce Snap Video, a video-first model that builds upon the EDM framework to accommodate spatially and temporally redundant pixels and naturally support video generation. We also propose a transformer-based architecture that trains 3.31 times faster than U-Nets and is ~4.5 times faster at inference, enabling efficient training of text-to-video models with billions of parameters for the first time. Our model achieves state-of-the-art results on various benchmarks, generates videos with higher quality, temporal consistency, and motion complexity, and outperforms recent methods in user studies.
Low GrooveSquid.com (original content) Low Difficulty Summary
Researchers have been using computer programs to create amazing images. Now, they’re trying to use these same programs to make videos. But, there’s a problem – videos are very repetitive, which makes it hard for the program to generate high-quality and realistic videos. To fix this issue, scientists created a new model called Snap Video that is specifically designed for making videos. This new model uses special techniques to handle the repetition in videos and can train much faster than previous models. As a result, Snap Video can create videos that are more realistic and have better motion, which makes it much more useful for things like movies and video games.

Keywords

» Artificial intelligence  » Image generation  » Inference  » Transformer