Learning Library

← Back to Library

Reinforcement-Learning

5 items in this topic

Paper

Efficient Video Reasoning with Dual-Answer Training

▲ 10 • research-paper • advanced

Introduces a “reason‑when‑necessary” policy that triggers deep reasoning only for ambiguous video frames, reducing unnecessary computation.
Proposes a “Thinking Once, Answering Twice” paradigm where the model generates an intermediate reasoning trace before producing two complementary answers, improving answer consistency.

Paper

Decoupled Reward Normalization for Stable Multi‑Reward RL

▲ 74 • research-paper • advanced

Directly applying GRPO’s group‑wise normalization to a mixture of rewards collapses distinct advantage signals into near‑identical values, hurting learning dynamics.
GDPO separates (decouples) the normalization step for each reward component, preserving their relative magnitudes before a final batch‑wise advantage scaling.

Paper

RL‑AWB: Reinforcement Learning for Nighttime White Balance

▲ 19 • research-paper • advanced

Introduces a hybrid pipeline that first applies a bespoke statistical gray‑pixel detector to estimate illumination in noisy, low‑light scenes.
Develops the first deep reinforcement learning (DRL) agent that treats the statistical estimator as its environment, learning to fine‑tune AWB parameters per‑image in a manner akin to a human expert.

Paper

Visual Identity Prompted Multi‑View Video Augmentation for Robotics

▲ 19 • research-paper • advanced

Introducing “visual identity prompting” supplies diffusion models with explicit object cues, enabling generation of consistent multi‑view videos that preserve object appearance across frames.
The generated videos serve as high‑fidelity data augmentations, enriching the visual diversity of manipulation datasets without manual collection.

Paper

Tree‑Search Guided Multi‑Turn Policy Optimization

▲ 15 • research-paper • advanced

Turn‑level tree search injects diverse, forward‑looking trajectories, dramatically improving exploration in multi‑turn environments.
By formulating separate learning objectives for each turn, AT²PO provides clearer credit assignment across long horizons.