Loading Now

Summary of Stiv: Scalable Text and Image Conditioned Video Generation, by Zongyu Lin and Wei Liu and Chen Chen and Jiasen Lu and Wenze Hu and Tsu-jui Fu and Jesse Allardice and Zhengfeng Lai and Liangchen Song and Bowen Zhang and Cha Chen and Yiran Fei and Yifan Jiang and Lezhi Li and Yizhou Sun and Kai-wei Chang and Yinfei Yang


STIV: Scalable Text and Image Conditioned Video Generation

by Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Yifan Jiang, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang

First submitted to arxiv on: 10 Dec 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The presented study systematically explores the interplay of model architectures, training recipes, and data curation strategies to develop robust and scalable text-image-conditioned video generation models. The proposed framework, named STIV, integrates image condition into a Diffusion Transformer (DiT) through frame replacement and incorporates text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously, with applications in video prediction, frame interpolation, multi-view generation, and long video generation. Comprehensive ablation studies demonstrate strong performance despite the simple design, surpassing leading open and closed-source models.
Low GrooveSquid.com (original content) Low Difficulty Summary
The study presents a new way to generate videos that are conditioned on text or images. The method is called STIV and it uses a combination of image and text information to create videos. This can be used for tasks like generating videos from text descriptions, creating videos with specific scenes or characters, and even predicting what will happen in a video next. The study shows that this method works well and can outperform other state-of-the-art methods.

Keywords

» Artificial intelligence  » Diffusion  » Transformer