Loading Now

Summary of Harivo: Harnessing Text-to-image Models For Video Generation, by Mingi Kwon et al.


HARIVO: Harnessing Text-to-Image Models for Video Generation

by Mingi Kwon, Seoung Wug Oh, Yang Zhou, Difan Liu, Joon-Young Lee, Haoran Cai, Baqiao Liu, Feng Liu, Youngjung Uh

First submitted to arxiv on: 10 Oct 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a method to create diffusion-based video models from pre-trained Text-to-Image (T2I) models. Building upon the AnimateDiff approach, which freezes the T2I model and only trains temporal layers, this work proposes a unique architecture incorporating mapping networks and frame-wise tokens for video generation. The key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique to ensure realistic and temporally consistent video generation despite limited public video data. The method simplifies training processes and allows seamless integration with off-the-shelf models like ControlNet and DreamBooth.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper creates a way to make videos using text-to-image models. It’s an improvement on the AnimateDiff approach, which only trains temporal layers while keeping the rest of the model frozen. The new method uses special networks and tokens to generate frames in a video. It also has new ways to measure how smooth the video is over time and makes sure the video looks realistic. This means that even with limited video data, you can still make good videos.

Keywords

» Artificial intelligence  » Diffusion