Paper
▲ 10
•
research-paper
•
advanced
- Introduces a “reason‑when‑necessary” policy that triggers deep reasoning only for ambiguous video frames, reducing unnecessary computation.
- Proposes a “Thinking Once, Answering Twice” paradigm where the model generates an intermediate reasoning trace before producing two complementary answers, improving answer consistency.
Paper
▲ 74
•
research-paper
•
advanced
- Directly applying GRPO’s group‑wise normalization to a mixture of rewards collapses distinct advantage signals into near‑identical values, hurting learning dynamics.
- GDPO separates (decouples) the normalization step for each reward component, preserving their relative magnitudes before a final batch‑wise advantage scaling.
Paper
▲ 19
•
research-paper
•
advanced
- Introduces a hybrid pipeline that first applies a bespoke statistical gray‑pixel detector to estimate illumination in noisy, low‑light scenes.
- Develops the first deep reinforcement learning (DRL) agent that treats the statistical estimator as its environment, learning to fine‑tune AWB parameters per‑image in a manner akin to a human expert.
Paper
▲ 19
•
research-paper
•
advanced
- Introducing “visual identity prompting” supplies diffusion models with explicit object cues, enabling generation of consistent multi‑view videos that preserve object appearance across frames.
- The generated videos serve as high‑fidelity data augmentations, enriching the visual diversity of manipulation datasets without manual collection.
Paper
▲ 15
•
research-paper
•
advanced
- Turn‑level tree search injects diverse, forward‑looking trajectories, dramatically improving exploration in multi‑turn environments.
- By formulating separate learning objectives for each turn, AT²PO provides clearer credit assignment across long horizons.