Learning Library

← Back to Library

Computer-Vision

12 items in this topic

Paper

PlenopticDreamer: Coherent Multi‑View Video Synthesis

  • Introduces a camera‑guided retrieval module that pulls relevant latent frames from a pre‑built spatio‑temporal memory, ensuring consistent geometry across different viewpoints.
  • Employs progressive training (stage‑wise spatial then temporal finetuning) to stabilize GAN learning and significantly boost temporal coherence without sacrificing spatial detail.
Paper

Pixel‑Perfect Diffusion Transformers for Depth Estimation

  • Introduces **Pixel‑Perfect Depth (PPD)**, a monocular depth model that operates directly in pixel space using diffusion transformers, eliminating flying pixels and preserving fine scene details.
  • **Semantics‑Prompted DiT** injects high‑level semantic embeddings from large vision foundation models into the diffusion process, guiding global structure while still allowing the model to recover sharp local geometry.
Paper

Efficient Video Reasoning with Dual-Answer Training

  • Introduces a “reason‑when‑necessary” policy that triggers deep reasoning only for ambiguous video frames, reducing unnecessary computation.
  • Proposes a “Thinking Once, Answering Twice” paradigm where the model generates an intermediate reasoning trace before producing two complementary answers, improving answer consistency.
Paper

4D Geometric Control for Realistic Video World Modeling

  • Introduces a unified 4D representation (static background point cloud + per‑object 3D Gaussian trajectories) that captures both camera motion and object dynamics in space‑time.
  • Leverages this representation as conditioning for a pretrained video diffusion model, yielding view‑consistent, high‑fidelity videos that strictly follow specified 4D motions.
Paper

One‑Shot Functional Dexterous Grasp Learning via Synthetic Transfer

  • A correspondence‑based data engine turns a single human demonstration into thousands of high‑quality, category‑wide synthetic training examples by morphing object meshes, transferring the expert grasp, and locally optimizing it.
  • The generated dataset encodes both semantic (tool function) and geometric cues, enabling a multimodal network to predict grasps that respect the intended usage (e.g., pulling, cutting).
Paper

Quantum‑Enhanced Neural Radiance Fields for Compact 3D Synthesis

  • QNeRF replaces large MLPs in NeRF with parameterised quantum circuits, exploiting superposition and entanglement to encode spatial and view‑dependent features.
  • Two variants are proposed: **Full QNeRF** uses the entire quantum state for maximal expressivity, while **Dual‑Branch QNeRF** splits spatial and view encodings, dramatically lowering circuit depth and improving scalability to near‑term hardware.
Paper

Single‑Shot 4D Mesh Reconstruction from Monocular Video

  • A compact spatio‑temporal latent space encodes an entire animation sequence in one forward pass, enabling “one‑shot” reconstruction of 3D shape and motion.
  • The latent space is learned with a skeleton‑guided autoencoder, providing strong deformation priors during training while requiring no skeletal input at test time.
Paper

RL‑AWB: Reinforcement Learning for Nighttime White Balance

  • Introduces a hybrid pipeline that first applies a bespoke statistical gray‑pixel detector to estimate illumination in noisy, low‑light scenes.
  • Develops the first deep reinforcement learning (DRL) agent that treats the statistical estimator as its environment, learning to fine‑tune AWB parameters per‑image in a manner akin to a human expert.
Paper

Generalized Referring Expressions for Multi‑Target Vision‑Language Tasks

  • Introduces GREx, a unified benchmark that expands traditional referring expression tasks (RES, REC, REG) to support single‑target, multi‑target, and no‑target expressions, enabling more realistic and flexible language‑vision interactions.
  • Releases gRefCOCO, the first large‑scale dataset containing annotated images with all three expression types, while remaining backward‑compatible with existing RES/REC datasets for fair comparison.
Paper

Visual Identity Prompted Multi‑View Video Augmentation for Robotics

  • Introducing “visual identity prompting” supplies diffusion models with explicit object cues, enabling generation of consistent multi‑view videos that preserve object appearance across frames.
  • The generated videos serve as high‑fidelity data augmentations, enriching the visual diversity of manipulation datasets without manual collection.
Paper

Entropy‑Guided Token Attacks on Vision‑Language Models

  • Tokens with the highest predictive entropy dominate the semantic output of V‑L models; tampering only with these few tokens yields large degradations.
  • Entropy‑driven attacks achieve comparable (or greater) success with far lower perturbation budgets than naïve or gradient‑based token attacks.
Paper

DiffThinker: Diffusion‑Based Generative Multimodal Reasoning

  • Reformulates multimodal reasoning as a native image‑to‑image generation task, enabling direct manipulation of visual information instead of indirect text prompts.
  • Demonstrates four intrinsic advantages—efficiency, controllability, native parallelism, and seamless collaboration between vision and language modules—leading to more logically consistent and spatially precise outputs.