Summary of Framebridge: Improving Image-to-video Generation with Bridge Models, by Yuji Wang et al.
FrameBridge: Improving Image-to-Video Generation with Bridge Models
by Yuji Wang, Zehua Chen, Xiaoyu Chen, Jun Zhu, Jianfei Chen
First submitted to arxiv on: 20 Oct 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper presents FrameBridge, a novel image-to-video (I2V) generation model that leverages the prior knowledge of a given static image to generate video samples with both appearance consistency and temporal coherence. Unlike diffusion-based methods, which rely on noise-to-data generation processes, FrameBridge uses a data-to-data process to facilitate learning the animation process from input images. The authors also propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, to improve the fine-tuning efficiency of pre-trained text-to-video (T2V) models and synthesis quality of bridge-based I2V models, respectively. Experimental results on WebVid-2M and UCF-101 demonstrate that FrameBridge outperforms diffusion-based methods in I2V quality, with zero-shot FVD scores of 83 on MSR-VTT and non-zero-shot FVD scores of 122 on UCF-101. The authors also show that their proposed techniques enhance the performance of bridge-based I2V models in fine-tuning and training-from-scratch scenarios. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about creating videos from still images, which could be useful for things like video games or movies. The researchers developed a new way to do this called FrameBridge, which uses the image as a guide to create more realistic videos. They also came up with two ideas to make it work better: one helps fine-tune pre-trained models and the other makes the generated videos look more natural. To test their approach, they used two big datasets of images and videos, and found that FrameBridge did a much better job than previous methods in creating high-quality videos. |
Keywords
» Artificial intelligence » Diffusion » Fine tuning » Zero shot