Summary of Harivo: Harnessing Text-to-image Models For Video Generation, by Mingi Kwon et al.
HARIVO: Harnessing Text-to-Image Models for Video Generation
by Mingi Kwon, Seoung Wug Oh, Yang Zhou, Difan Liu, Joon-Young Lee, Haoran Cai, Baqiao Liu, Feng Liu, Youngjung Uh
First submitted to arxiv on: 10 Oct 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a method to create diffusion-based video models from pre-trained Text-to-Image (T2I) models. Building upon the AnimateDiff approach, which freezes the T2I model and only trains temporal layers, this work proposes a unique architecture incorporating mapping networks and frame-wise tokens for video generation. The key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique to ensure realistic and temporally consistent video generation despite limited public video data. The method simplifies training processes and allows seamless integration with off-the-shelf models like ControlNet and DreamBooth. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a way to make videos using text-to-image models. It’s an improvement on the AnimateDiff approach, which only trains temporal layers while keeping the rest of the model frozen. The new method uses special networks and tokens to generate frames in a video. It also has new ways to measure how smooth the video is over time and makes sure the video looks realistic. This means that even with limited video data, you can still make good videos. |
Keywords
» Artificial intelligence » Diffusion