Summary of Stiv: Scalable Text and Image Conditioned Video Generation, by Zongyu Lin and Wei Liu and Chen Chen and Jiasen Lu and Wenze Hu and Tsu-jui Fu and Jesse Allardice and Zhengfeng Lai and Liangchen Song and Bowen Zhang and Cha Chen and Yiran Fei and Yifan Jiang and Lezhi Li and Yizhou Sun and Kai-wei Chang and Yinfei Yang
STIV: Scalable Text and Image Conditioned Video Generation
by Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Yifan Jiang, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang
First submitted to arxiv on: 10 Dec 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The presented study systematically explores the interplay of model architectures, training recipes, and data curation strategies to develop robust and scalable text-image-conditioned video generation models. The proposed framework, named STIV, integrates image condition into a Diffusion Transformer (DiT) through frame replacement and incorporates text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously, with applications in video prediction, frame interpolation, multi-view generation, and long video generation. Comprehensive ablation studies demonstrate strong performance despite the simple design, surpassing leading open and closed-source models. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The study presents a new way to generate videos that are conditioned on text or images. The method is called STIV and it uses a combination of image and text information to create videos. This can be used for tasks like generating videos from text descriptions, creating videos with specific scenes or characters, and even predicting what will happen in a video next. The study shows that this method works well and can outperform other state-of-the-art methods. |
Keywords
» Artificial intelligence » Diffusion » Transformer