Summary of Stiv: Scalable Text and Image Conditioned Video Generation, by Zongyu Lin and Wei Liu and Chen Chen and Jiasen Lu and Wenze Hu and Tsu-jui Fu and Jesse Allardice and Zhengfeng Lai and Liangchen Song and Bowen Zhang and Cha Chen and Yiran Fei and Yifan Jiang and Lezhi Li and Yizhou Sun and Kai-wei Chang and Yinfei Yang

STIV: Scalable Text and Image Conditioned Video Generation

by Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Yifan Jiang, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang

First submitted to arxiv on: 10 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The presented study systematically explores the interplay of model architectures, training recipes, and data curation strategies to develop robust and scalable text-image-conditioned video generation models. The proposed framework, named STIV, integrates image condition into a Diffusion Transformer (DiT) through frame replacement and incorporates text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously, with applications in video prediction, frame interpolation, multi-view generation, and long video generation. Comprehensive ablation studies demonstrate strong performance despite the simple design, surpassing leading open and closed-source models.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The study presents a new way to generate videos that are conditioned on text or images. The method is called STIV and it uses a combination of image and text information to create videos. This can be used for tasks like generating videos from text descriptions, creating videos with specific scenes or characters, and even predicting what will happen in a video next. The study shows that this method works well and can outperform other state-of-the-art methods.

Keywords

» Artificial intelligence » Diffusion » Transformer

STIV: Scalable Text and Image Conditioned Video Generation

by Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Yifan Jiang, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Ahsg: Adversarial Attacks on High-level Semantics in Graph Neural Networks, by Kai Yuan et al.

Summary of Portraittalk: Towards Customizable One-shot Audio-to-talking Face Generation, by Fatemeh Nazarieh et al.

Related Posts