Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Accepted at NeurIPS 2024
1The Hong Kong Polytechnic University, 2Amazon, 3Southern University of Science and Technology

Abstract

Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions.

Background

Most existing T2V models produce outputs that resemble static animations or exhibit minimal camera movement , falling short of capturing the intricate motions described in textual inputs. This limitation arises from two primary challenges:


The first challenge is the inadequate motion representation in text encoding. Current T2V models utilize large-scale visual-language models (VLMs), such as CLIP, as text encoders. These VLMs are highly effective at capturing static elements and spatial relationships but struggle with encoding dynamic motions. This is primarily due to their training focus, which biases them towards recognizing nouns and objects, while verbs and actions are less accurately represented.

We generated a set of prompts following a fixed template, grouping them according to the different parts of speech (POS). These grouped texts are then passed into the CLIP text encoder, and we calculate the sensitivity as the average sentence distance within each group. As shown above, compared to POS representing content, CLIP is less sensitive to POS representing motion.


The second challenge is the reliance on spatial-only text conditioning. Existing models often extend Text-to-Image (T2I) generation techniques to T2V tasks, applying text information through spatial cross-attention on a frame-by-frame basis. While effective for generating high-quality static images, this approach is insufficient for videos, where motion is a critical component that spans both spatial and temporal dimensions. A holistic approach that integrates text information across these dimensions is essential for generating videos with realistic motion dynamics.

Methodology

DEMO incorporates dual text encoding and text conditioning (for simplicity, other layers in the UNet are omitted). During training, the \(\textcolor{red}{\mathcal{L}_{\text{text-motion}}}\) is used to enhance motion encoding, the \(\textcolor{green}{\mathcal{L}_{\text{reg}}}\) is used to avoid catastrophic forgetting, and the \(\textcolor{yellow}{\mathcal{L}_{\text{video-motion}}}\) is used to enhance motion integration. The snowflakes and flames denote frozen and trainable parameters, respectively.

To address these challenges, we introduce Decomposed Motion (DEMO), a novel framework designed to enhance motion synthesis in T2V generation. DEMO adopts a comprehensive approach by decomposing both text encoding and conditioning processes into content and motion components.


Addressing the first challenge, DEMO decomposes text encoding into content encoding and motion encoding processes. The content encoding focuses on object appearance and spatial layout while the motion encoding captures the essence of object movement and temporal dynamics. This separation allows the model to better understand and represent the dynamic aspects of the described scenes.

Regarding the second challenge, DEMO decomposes the text conditioning process into content and motion dimensions. The content conditioning module integrates spatial embeddings into the video generation process on a frame-by-frame basis, ensuring that static elements are accurately depicted in each frame. In contrast, the motion conditioning module operates across the temporal dimension, infusing dynamic motion embeddings into the video.

Moreover, DEMO incorporates novel text-motion and video-motion supervision techniques to enhance the model's understanding and generation of motion.

Societal Impact

This project aims to improve motion generation for text-to-video technologies, enabling more precise and expressive video content creation from text descriptions alone. By addressing the limitations in text encoding and conditioning mechanisms found in current literature, we hope to make advancements that improve the accessibility and effectiveness of video generation for creative industries, education, and entertainment. Our research also underscores the importance of appropriate text-video alignment to achieve fine-grained text-video alignment, which can contribute to more nuanced storytelling, enhanced educational tools, and richer visual communication.

BibTeX


      @misc{ruan2024enhancingmotiontexttovideogeneration,
      title={Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning}, 
      author={Penghui Ruan and Pichao Wang and Divya Saxena and Jiannong Cao and Yuhui Shi},
      year={2024},
      eprint={2410.24219},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.24219}, 
}