Computer-Vision - Learning Library

Paper

PlenopticDreamer: Coherent Multi‑View Video Synthesis

▲ 6 • research-paper • advanced

Introduces a camera‑guided retrieval module that pulls relevant latent frames from a pre‑built spatio‑temporal memory, ensuring consistent geometry across different viewpoints.
Employs progressive training (stage‑wise spatial then temporal finetuning) to stabilize GAN learning and significantly boost temporal coherence without sacrificing spatial detail.

Paper

research-paper • advanced

Introduces **Pixel‑Perfect Depth (PPD)**, a monocular depth model that operates directly in pixel space using diffusion transformers, eliminating flying pixels and preserving fine scene details.
**Semantics‑Prompted DiT** injects high‑level semantic embeddings from large vision foundation models into the diffusion process, guiding global structure while still allowing the model to recover sharp local geometry.

Paper

▲ 10 • research-paper • advanced

Introduces a “reason‑when‑necessary” policy that triggers deep reasoning only for ambiguous video frames, reducing unnecessary computation.
Proposes a “Thinking Once, Answering Twice” paradigm where the model generates an intermediate reasoning trace before producing two complementary answers, improving answer consistency.

Paper

▲ 11 • research-paper • advanced

Introduces a unified 4D representation (static background point cloud + per‑object 3D Gaussian trajectories) that captures both camera motion and object dynamics in space‑time.
Leverages this representation as conditioning for a pretrained video diffusion model, yielding view‑consistent, high‑fidelity videos that strictly follow specified 4D motions.

Paper

research-paper • advanced

A correspondence‑based data engine turns a single human demonstration into thousands of high‑quality, category‑wide synthetic training examples by morphing object meshes, transferring the expert grasp, and locally optimizing it.
The generated dataset encodes both semantic (tool function) and geometric cues, enabling a multimodal network to predict grasps that respect the intended usage (e.g., pulling, cutting).

Paper

research-paper • advanced

QNeRF replaces large MLPs in NeRF with parameterised quantum circuits, exploiting superposition and entanglement to encode spatial and view‑dependent features.
Two variants are proposed: **Full QNeRF** uses the entire quantum state for maximal expressivity, while **Dual‑Branch QNeRF** splits spatial and view encodings, dramatically lowering circuit depth and improving scalability to near‑term hardware.

Paper

research-paper • advanced

A compact spatio‑temporal latent space encodes an entire animation sequence in one forward pass, enabling “one‑shot” reconstruction of 3D shape and motion.
The latent space is learned with a skeleton‑guided autoencoder, providing strong deformation priors during training while requiring no skeletal input at test time.

Paper

▲ 19 • research-paper • advanced

Introduces a hybrid pipeline that first applies a bespoke statistical gray‑pixel detector to estimate illumination in noisy, low‑light scenes.
Develops the first deep reinforcement learning (DRL) agent that treats the statistical estimator as its environment, learning to fine‑tune AWB parameters per‑image in a manner akin to a human expert.

Paper

research-paper • advanced

Introduces GREx, a unified benchmark that expands traditional referring expression tasks (RES, REC, REG) to support single‑target, multi‑target, and no‑target expressions, enabling more realistic and flexible language‑vision interactions.
Releases gRefCOCO, the first large‑scale dataset containing annotated images with all three expression types, while remaining backward‑compatible with existing RES/REC datasets for fair comparison.

Paper

▲ 19 • research-paper • advanced

Introducing “visual identity prompting” supplies diffusion models with explicit object cues, enabling generation of consistent multi‑view videos that preserve object appearance across frames.
The generated videos serve as high‑fidelity data augmentations, enriching the visual diversity of manipulation datasets without manual collection.

Paper

▲ 6 • research-paper • advanced

Tokens with the highest predictive entropy dominate the semantic output of V‑L models; tampering only with these few tokens yields large degradations.
Entropy‑driven attacks achieve comparable (or greater) success with far lower perturbation budgets than naïve or gradient‑based token attacks.

Paper

▲ 22 • research-paper • advanced

Reformulates multimodal reasoning as a native image‑to‑image generation task, enabling direct manipulation of visual information instead of indirect text prompts.
Demonstrates four intrinsic advantages—efficiency, controllability, native parallelism, and seamless collaboration between vision and language modules—leading to more logically consistent and spatially precise outputs.