Single‑Shot 4D Mesh Reconstruction from Monocular Video

← Back to Papers

Research Paper

Single‑Shot 4D Mesh Reconstruction from Monocular Video

Authors: Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

computer-vision ai-ml advanced • arXiv ↗ • HuggingFace ↗

Published: 2026-01-08 • Added: 2026-01-09

Key Insights

A compact spatio‑temporal latent space encodes an entire animation sequence in one forward pass, enabling “one‑shot” reconstruction of 3D shape and motion.
The latent space is learned with a skeleton‑guided autoencoder, providing strong deformation priors during training while requiring no skeletal input at test time.
A latent diffusion model conditioned on the input video and the first‑frame mesh predicts the full deformation field, yielding more accurate geometry and realistic novel‑view synthesis than prior methods.

Abstract

We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.

Full Analysis

# Single‑Shot 4D Mesh Reconstruction from Monocular Video **Authors:** Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi **Source:** [HuggingFace](None) | [arXiv](https://arxiv.org/abs/2601.05251) **Published:** 2026-01-08 ## Summary - A compact spatio‑temporal latent space encodes an entire animation sequence in one forward pass, enabling “one‑shot” reconstruction of 3D shape and motion. - The latent space is learned with a skeleton‑guided autoencoder, providing strong deformation priors during training while requiring no skeletal input at test time. - A latent diffusion model conditioned on the input video and the first‑frame mesh predicts the full deformation field, yielding more accurate geometry and realistic novel‑view synthesis than prior methods. ## Abstract We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation. --- *Topics: computer-vision, ai-ml* *Difficulty: advanced* *Upvotes: 0*