Summary of Dreamrunner: Fine-grained Compositional Story-to-video Generation with Retrieval-augmented Motion Adaptation, by Zun Wang et al.
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
by Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal
First submitted to arxiv on: 25 Nov 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Storytelling video generation (SVG) aims to create coherent and visually rich multi-scene videos following a structured narrative. Existing methods primarily employ Large Language Models (LLMs) for high-level planning, decomposing the story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with complex single-scene descriptions, involving coherent composition of multiple characters and events, complex motion synthesis, and multi-character customization. To address these challenges, we propose DreamRunner, a novel story-to-video generation method that structures the input script using an LLM to facilitate coarse-grained scene planning and fine-grained object-level layout and motion planning. DreamRunner also presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos. Additionally, we propose a novel spatial-temporal region-based 3D attention and prior injection module (SR3AI) for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Furthermore, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about creating videos that tell a story. Current methods are good at breaking down the story into scenes and then making each scene separately, but they struggle to make the whole video look smooth and consistent. The researchers propose a new method called DreamRunner that can do this better. It works by planning out the story in detail and then generating the video based on those plans. This helps DreamRunner create videos with complex motions, like characters moving around and interacting with each other. The researchers tested their method against others and found it performed better. They also showed examples of multi-object interactions, where multiple characters are involved in a scene. |
Keywords
» Artificial intelligence » Alignment » Attention