Loading Now

Summary of Dreamrunner: Fine-grained Compositional Story-to-video Generation with Retrieval-augmented Motion Adaptation, by Zun Wang et al.


DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

by Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal

First submitted to arxiv on: 25 Nov 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Storytelling video generation (SVG) aims to create coherent and visually rich multi-scene videos following a structured narrative. Existing methods primarily employ Large Language Models (LLMs) for high-level planning, decomposing the story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with complex single-scene descriptions, involving coherent composition of multiple characters and events, complex motion synthesis, and multi-character customization. To address these challenges, we propose DreamRunner, a novel story-to-video generation method that structures the input script using an LLM to facilitate coarse-grained scene planning and fine-grained object-level layout and motion planning. DreamRunner also presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos. Additionally, we propose a novel spatial-temporal region-based 3D attention and prior injection module (SR3AI) for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Furthermore, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about creating videos that tell a story. Current methods are good at breaking down the story into scenes and then making each scene separately, but they struggle to make the whole video look smooth and consistent. The researchers propose a new method called DreamRunner that can do this better. It works by planning out the story in detail and then generating the video based on those plans. This helps DreamRunner create videos with complex motions, like characters moving around and interacting with each other. The researchers tested their method against others and found it performed better. They also showed examples of multi-object interactions, where multiple characters are involved in a scene.

Keywords

» Artificial intelligence  » Alignment  » Attention