Loading Now

Summary of Factorized-dreamer: Training a High-quality Video Generator with Limited and Low-quality Data, by Tao Yang et al.


Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

by Tao Yang, Yangming Shi, Yunwen Huang, Feng Chen, Yin Zheng, Lei Zhang

First submitted to arxiv on: 19 Aug 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A new approach to text-to-video (T2V) generation is presented, which shows that publicly available limited and low-quality data are sufficient to train a high-quality video generator. The proposed method, Factorized-Dreamer, factorizes the T2V process into two steps: generating an image conditioned on a caption and synthesizing the video based on the generated image and motion details. It incorporates an adapter to combine text and image embeddings, pixel-aware cross attention modules, and PredictNet for optical flow supervision. The model can be trained directly on limited datasets with noisy captions, alleviating the need for large-scale high-quality video-text pairs.
Low GrooveSquid.com (original content) Low Difficulty Summary
This research focuses on creating high-quality videos from text descriptions. It’s like a magic trick where you input words and get a video that matches what you wrote! To make this happen, scientists broke down the process into two parts: first, they generate an image based on the text, and then create a video that follows the movements described in the text. They designed a special machine learning model called Factorized-Dreamer to help with this task. This model can even work with limited data and noisy descriptions, making it more accessible for people who want to try it out.

Keywords

» Artificial intelligence  » Cross attention  » Machine learning  » Optical flow