One‑Shot Functional Dexterous Grasp Learning via Synthetic Transfer

← Back to Papers

Research Paper

One‑Shot Functional Dexterous Grasp Learning via Synthetic Transfer

Authors: Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang...

robotics computer-vision multimodal advanced • arXiv ↗ • HuggingFace ↗

Published: 2026-01-08 • Added: 2026-01-09

Key Insights

A correspondence‑based data engine turns a single human demonstration into thousands of high‑quality, category‑wide synthetic training examples by morphing object meshes, transferring the expert grasp, and locally optimizing it.
The generated dataset encodes both semantic (tool function) and geometric cues, enabling a multimodal network to predict grasps that respect the intended usage (e.g., pulling, cutting).
A local‑global fusion module merges image features with point‑cloud geometry, while importance‑aware sampling focuses computation on hand‑contact regions, yielding fast inference without sacrificing accuracy.
Experiments show strong zero‑shot performance on unseen objects and consistent outperformance of prior dexterous grasping baselines, even when evaluated on real‑world robot hands.

Abstract

Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.

Full Analysis

# One‑Shot Functional Dexterous Grasp Learning via Synthetic Transfer **Authors:** Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang... **Source:** [HuggingFace](None) | [arXiv](https://arxiv.org/abs/2601.05243) **Published:** 2026-01-08 ## Summary - A correspondence‑based data engine turns a single human demonstration into thousands of high‑quality, category‑wide synthetic training examples by morphing object meshes, transferring the expert grasp, and locally optimizing it. - The generated dataset encodes both semantic (tool function) and geometric cues, enabling a multimodal network to predict grasps that respect the intended usage (e.g., pulling, cutting). - A local‑global fusion module merges image features with point‑cloud geometry, while importance‑aware sampling focuses computation on hand‑contact regions, yielding fast inference without sacrificing accuracy. - Experiments show strong zero‑shot performance on unseen objects and consistent outperformance of prior dexterous grasping baselines, even when evaluated on real‑world robot hands. ## Abstract Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines. --- *Topics: robotics, computer-vision, multimodal* *Difficulty: advanced* *Upvotes: 0*