Loading Now

Summary of Movie Gen: a Cast Of Media Foundation Models, by Adam Polyak et al.


Movie Gen: A Cast of Media Foundation Models

by Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xiaoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian He, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh Wood, Ce Liu, Cen Peng, Dimitry Vengertsev, Edgar Schonfeld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos Georgopoulos, Rashel Moritz, Sara K. Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, Yuming Du

First submitted to arxiv on: 17 Oct 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
We present Movie Gen, a suite of foundation models that generate high-quality, 1080p HD videos with varying aspect ratios and synchronized audio. The models demonstrate capabilities such as precise instruction-based video editing and personalized video generation based on user images. Our models achieve state-of-the-art performance on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. The largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, equivalent to generating a 16-second video at 16 frames-per-second. We introduce innovations in architecture, latent spaces, training objectives, data curation, evaluation protocols, parallelization techniques, and inference optimizations, enabling the benefits of scaling pre-training data, model size, and training compute for large-scale media generation models. Our work aims to accelerate progress and innovation in media generation models.
Low GrooveSquid.com (original content) Low Difficulty Summary
We’ve created a new way to make high-quality videos with different sizes and sounds that match each other. This technology can also edit videos based on instructions and create personalized videos from users’ images. Our system is so good that it sets a new standard for making videos, editing videos, and turning text into audio or video. The biggest model we’ve created has 30 billion parameters and can generate a 16-second video at high speed. We’ve also developed some new ideas to make this technology work better, like how we train the models and what kind of data we use. Our goal is to help others improve their own video-making technology.

Keywords

» Artificial intelligence  » Context length  » Inference  » Transformer