Paper
▲ 6
•
research-paper
•
advanced
- Introduces a camera‑guided retrieval module that pulls relevant latent frames from a pre‑built spatio‑temporal memory, ensuring consistent geometry across different viewpoints.
- Employs progressive training (stage‑wise spatial then temporal finetuning) to stabilize GAN learning and significantly boost temporal coherence without sacrificing spatial detail.
Paper
research-paper
•
advanced
- Introduces **Pixel‑Perfect Depth (PPD)**, a monocular depth model that operates directly in pixel space using diffusion transformers, eliminating flying pixels and preserving fine scene details.
- **Semantics‑Prompted DiT** injects high‑level semantic embeddings from large vision foundation models into the diffusion process, guiding global structure while still allowing the model to recover sharp local geometry.
Paper
▲ 10
•
research-paper
•
advanced
- Introduces a “reason‑when‑necessary” policy that triggers deep reasoning only for ambiguous video frames, reducing unnecessary computation.
- Proposes a “Thinking Once, Answering Twice” paradigm where the model generates an intermediate reasoning trace before producing two complementary answers, improving answer consistency.
Paper
▲ 11
•
research-paper
•
advanced
- Introduces a unified 4D representation (static background point cloud + per‑object 3D Gaussian trajectories) that captures both camera motion and object dynamics in space‑time.
- Leverages this representation as conditioning for a pretrained video diffusion model, yielding view‑consistent, high‑fidelity videos that strictly follow specified 4D motions.
Paper
research-paper
•
advanced
- A correspondence‑based data engine turns a single human demonstration into thousands of high‑quality, category‑wide synthetic training examples by morphing object meshes, transferring the expert grasp, and locally optimizing it.
- The generated dataset encodes both semantic (tool function) and geometric cues, enabling a multimodal network to predict grasps that respect the intended usage (e.g., pulling, cutting).
Paper
research-paper
•
advanced
- QNeRF replaces large MLPs in NeRF with parameterised quantum circuits, exploiting superposition and entanglement to encode spatial and view‑dependent features.
- Two variants are proposed: **Full QNeRF** uses the entire quantum state for maximal expressivity, while **Dual‑Branch QNeRF** splits spatial and view encodings, dramatically lowering circuit depth and improving scalability to near‑term hardware.
Paper
research-paper
•
advanced
- A compact spatio‑temporal latent space encodes an entire animation sequence in one forward pass, enabling “one‑shot” reconstruction of 3D shape and motion.
- The latent space is learned with a skeleton‑guided autoencoder, providing strong deformation priors during training while requiring no skeletal input at test time.
Paper
▲ 19
•
research-paper
•
advanced
- Introduces a hybrid pipeline that first applies a bespoke statistical gray‑pixel detector to estimate illumination in noisy, low‑light scenes.
- Develops the first deep reinforcement learning (DRL) agent that treats the statistical estimator as its environment, learning to fine‑tune AWB parameters per‑image in a manner akin to a human expert.
Paper
research-paper
•
advanced
- Introduces GREx, a unified benchmark that expands traditional referring expression tasks (RES, REC, REG) to support single‑target, multi‑target, and no‑target expressions, enabling more realistic and flexible language‑vision interactions.
- Releases gRefCOCO, the first large‑scale dataset containing annotated images with all three expression types, while remaining backward‑compatible with existing RES/REC datasets for fair comparison.
Paper
▲ 19
•
research-paper
•
advanced
- Introducing “visual identity prompting” supplies diffusion models with explicit object cues, enabling generation of consistent multi‑view videos that preserve object appearance across frames.
- The generated videos serve as high‑fidelity data augmentations, enriching the visual diversity of manipulation datasets without manual collection.
Paper
▲ 6
•
research-paper
•
advanced
- Tokens with the highest predictive entropy dominate the semantic output of V‑L models; tampering only with these few tokens yields large degradations.
- Entropy‑driven attacks achieve comparable (or greater) success with far lower perturbation budgets than naïve or gradient‑based token attacks.
Paper
▲ 22
•
research-paper
•
advanced
- Reformulates multimodal reasoning as a native image‑to‑image generation task, enabling direct manipulation of visual information instead of indirect text prompts.
- Demonstrates four intrinsic advantages—efficiency, controllability, native parallelism, and seamless collaboration between vision and language modules—leading to more logically consistent and spatially precise outputs.