Summary of Identifying and Solving Conditional Image Leakage in Image-to-video Diffusion Model, by Min Zhao et al.
Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model
by Min Zhao, Hongzhou Zhu, Chendong Xiang, Kaiwen Zheng, Chongxuan Li, Jun Zhu
First submitted to arxiv on: 22 Jun 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates the limitations of diffusion models in generating videos with sufficient motion. Specifically, it finds that these models tend to produce videos with reduced motion due to a phenomenon called conditional image leakage. The authors propose two strategies to address this issue: first, starting the generation process from an earlier time step and introducing an initial noise distribution with optimal analytic expressions (Analytic-Init) to bridge the training-inference gap; second, designing a time-dependent noise distribution (TimeNoise) during training to disrupt the conditional image and reduce the model’s dependency on it. The authors validate these strategies on various diffusion models using their collected open-domain image benchmark and the UCF101 dataset. The results show that their methods outperform baselines in terms of motion scores, while maintaining image alignment and temporal consistency. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper looks at how computers can make videos from still images. It finds that the computer programs called diffusion models are not very good at making videos with lots of movement. This is because they rely too much on what the starting image is, instead of generating their own motion. The authors come up with two ways to fix this problem: first, they start the video generation process earlier and add some noise to make it more like real life; second, they add more noise to the starting image as time goes on, so that the computer program doesn’t rely too much on what it’s given. They test these ideas using special computer programs and a bunch of images and videos. |
Keywords
» Artificial intelligence » Alignment » Diffusion » Inference