4D Geometric Control for Realistic Video World Modeling

← Back to Papers

Research Paper

4D Geometric Control for Realistic Video World Modeling

Authors: Sixiao Zheng,

computer-vision multimodal advanced ▲ 11 • arXiv ↗ • HuggingFace ↗

Organization: Hugging Face

Published: 2026-01-09 • Added: 2026-01-09

Key Insights

Introduces a unified 4D representation (static background point cloud + per‑object 3D Gaussian trajectories) that captures both camera motion and object dynamics in space‑time.
Leverages this representation as conditioning for a pretrained video diffusion model, yielding view‑consistent, high‑fidelity videos that strictly follow specified 4D motions.
Provides an automatic pipeline to extract the 4D controls from wild video footage, enabling training on large‑scale, unannotated datasets despite the scarcity of explicit 4D labels.
Uses probabilistic 3D Gaussian trajectories instead of rigid boxes, offering a category‑agnostic, flexible way to model object occupancy and motion over time.
Demonstrates that explicit 4D control can be seamlessly integrated with existing diffusion‑based video generators without retraining the diffusion backbone.

Abstract

Video world modelsaim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel4D Geometric Controlrepresentation, which encodes the world state through a static backgroundpoint cloudand per-object3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrainedvideo diffusion model, enabling the generation of high-fidelity,view-consistent videosthat precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing anautomatic data enginethat extracts the required 4D controls fromin-the-wild videos, allowing us to train our model on a massive and diverse dataset.

Full Analysis

# 4D Geometric Control for Realistic Video World Modeling **Authors:** Sixiao Zheng, **Source:** [HuggingFace](https://huggingface.co/papers/2601.05138) | [arXiv](https://arxiv.org/abs/2601.05138) **Published:** 2026-01-09 **Organization:** Hugging Face ## Summary - Introduces a unified 4D representation (static background point cloud + per‑object 3D Gaussian trajectories) that captures both camera motion and object dynamics in space‑time. - Leverages this representation as conditioning for a pretrained video diffusion model, yielding view‑consistent, high‑fidelity videos that strictly follow specified 4D motions. - Provides an automatic pipeline to extract the 4D controls from wild video footage, enabling training on large‑scale, unannotated datasets despite the scarcity of explicit 4D labels. - Uses probabilistic 3D Gaussian trajectories instead of rigid boxes, offering a category‑agnostic, flexible way to model object occupancy and motion over time. - Demonstrates that explicit 4D control can be seamlessly integrated with existing diffusion‑based video generators without retraining the diffusion backbone. ## Abstract Video world modelsaim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel4D Geometric Controlrepresentation, which encodes the world state through a static backgroundpoint cloudand per-object3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrainedvideo diffusion model, enabling the generation of high-fidelity,view-consistent videosthat precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing anautomatic data enginethat extracts the required 4D controls fromin-the-wild videos, allowing us to train our model on a massive and diverse dataset. --- *Topics: computer-vision, multimodal* *Difficulty: advanced* *Upvotes: 11*