Learning Library

← Back to Papers
Research Paper

4D Geometric Control for Realistic Video World Modeling

Authors: Sixiao Zheng,
Organization: Hugging Face
Published: 2026-01-09 • Added: 2026-01-09

Key Insights

  • Introduces a unified 4D representation (static background point cloud + per‑object 3D Gaussian trajectories) that captures both camera motion and object dynamics in space‑time.
  • Leverages this representation as conditioning for a pretrained video diffusion model, yielding view‑consistent, high‑fidelity videos that strictly follow specified 4D motions.
  • Provides an automatic pipeline to extract the 4D controls from wild video footage, enabling training on large‑scale, unannotated datasets despite the scarcity of explicit 4D labels.
  • Uses probabilistic 3D Gaussian trajectories instead of rigid boxes, offering a category‑agnostic, flexible way to model object occupancy and motion over time.
  • Demonstrates that explicit 4D control can be seamlessly integrated with existing diffusion‑based video generators without retraining the diffusion backbone.

Abstract

Video world modelsaim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel4D Geometric Controlrepresentation, which encodes the world state through a static backgroundpoint cloudand per-object3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrainedvideo diffusion model, enabling the generation of high-fidelity,view-consistent videosthat precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing anautomatic data enginethat extracts the required 4D controls fromin-the-wild videos, allowing us to train our model on a massive and diverse dataset.

Full Analysis

# 4D Geometric Control for Realistic Video World Modeling **Authors:** Sixiao Zheng, **Source:** [HuggingFace](https://huggingface.co/papers/2601.05138) | [arXiv](https://arxiv.org/abs/2601.05138) **Published:** 2026-01-09 **Organization:** Hugging Face ## Summary - Introduces a unified 4D representation (static background point cloud + per‑object 3D Gaussian trajectories) that captures both camera motion and object dynamics in space‑time. - Leverages this representation as conditioning for a pretrained video diffusion model, yielding view‑consistent, high‑fidelity videos that strictly follow specified 4D motions. - Provides an automatic pipeline to extract the 4D controls from wild video footage, enabling training on large‑scale, unannotated datasets despite the scarcity of explicit 4D labels. - Uses probabilistic 3D Gaussian trajectories instead of rigid boxes, offering a category‑agnostic, flexible way to model object occupancy and motion over time. - Demonstrates that explicit 4D control can be seamlessly integrated with existing diffusion‑based video generators without retraining the diffusion backbone. ## Abstract Video world modelsaim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel4D Geometric Controlrepresentation, which encodes the world state through a static backgroundpoint cloudand per-object3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrainedvideo diffusion model, enabling the generation of high-fidelity,view-consistent videosthat precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing anautomatic data enginethat extracts the required 4D controls fromin-the-wild videos, allowing us to train our model on a massive and diverse dataset. --- *Topics: computer-vision, multimodal* *Difficulty: advanced* *Upvotes: 11*