Cross-Embodiment Dexterous Grasping
- Cross-embodiment dexterous grasping is a unified framework allowing morphologically diverse robotic hands to learn and execute robust grasping strategies.
- It employs unified representations, contact-centric approaches, and graph-based encoders to abstract over kinematics and facilitate seamless simulation-to-hardware transfers.
- Empirical evaluations show high success rates, rapid inference, and strong zero-shot generalization across novel hand designs and complex manipulation tasks.
Cross-embodiment dexterous grasping refers to the problem of learning policies, representations, and algorithms that enable multiple, morphologically diverse multi-fingered robotic hands to successfully grasp and manipulate objects, ideally using a single unified model or algorithmic pipeline. Unlike traditional approaches, which tightly couple perception, control, and kinematics to specific hand geometries or joint structures, cross-embodiment grasping seeks architectures, abstractions, and learning strategies that are robust to changes of morphology, scale, and kinematic complexity. This is motivated by the need for scalable robotic manipulation across heterogeneous platforms, rapid prototyping of new hand designs, and seamless transfer from simulation to novel hardware, all without retraining or manual retargeting.
1. Unified Representations and Action Spaces
Unified internal representations of action, observation, and control are foundational to cross-embodiment grasping. Several approaches have been proposed:
- Eigengrasp Subspaces: Many works define a low-dimensional action space derived from principal component analysis (PCA) on human hand or robot articulation datasets. Actions are specified in terms of “eigengrasp amplitudes” in the reduced subspace, which are then mapped to full robot joint configurations via learned or optimized retargeting networks. Notably, CrossDex constructs an eigengrasp basis from MANO hand poses and maps policies’ action outputs through a per-hand MLP approximation of DexPilot-style retargeting (Yuan et al., 2024).
- Contact-Centric Representations: AnyDexGrasp introduces a universal, hand-agnostic contact-centric grasp representation (CGR). Rather than predicting hand configurations directly, grasps are represented and learned in terms of local contact geometry—positions, normals, distances relative to object patches—with hand-specific postprocessing mapping contact patterns to executable grasp types (Fang et al., 23 Feb 2025). CEDex and similar models generate human-like contact distributions for each object and map these onto robotic hands with diverse topology via topological merging and signed distance field (SDF) optimization (Wu et al., 29 Sep 2025).
- Morphology-Aware and Canonical Pose Spaces: Recent works such as UniMorphGrasp remap all hand poses into a canonical user-defined high-DoF space (e.g., ShadowHand), zero-padding missing joints and encoding hand structure as a graph, enabling a diffusion model to generate grasps in a morphology-invariant latent space (Wu et al., 31 Jan 2026). Morphological information is encoded as joint or link graphs, node features (axis, limits, spatial origin), and integrated via graph-attention or cross-attention modules as in GeoMatch++ and T(R,O) Grasp (Wei et al., 2024, Fei et al., 14 Oct 2025).
2. Policy Architectures and Retargeting Mechanisms
Policy learning for cross-embodiment grasping requires architectures that both abstract over kinematic diversity and efficiently recover hand-specific joint commands.
- Graph-Based Encoders/Architectures: Morphologies are encoded as node-edge graphs parsed from URDFs, with features such as link dimensions, joint axis, and center of mass. GeoMatch++ uses GCNs to embed the hand morphology graph and cross-attends between object geometry and morphology, enabling autoregressive contact prediction. Graph diffusion models in T(R,O) Grasp operate on bipartite graphs encoding object patches, link features, and spatial transforms, with denoising diffusion and Lie-algebra-based link pose decoding (Fei et al., 14 Oct 2025, Wei et al., 2024).
- Morphology-Conditioned Control: Policies, often PPO-based, are conditioned on a global graph embedding of the morphology (via GNNs), as in cross-embodied co-design pipelines (Fay et al., 3 Dec 2025). Action spaces may be masked dynamically to account for differing numbers of actuators.
- Retargeting via Learned or Optimization Modules: For eigengrasp-based pipelines, mapping reduced-coefficient actions to full joint space is achieved either through per-hand MLP mappings (learned offline for each hand) or an explicit optimization (e.g., minimizing SDF, contact distances, or task loss) at runtime. Human-image-to-robot-action transfer strategies use optimization to align keypoints (e.g., fingertips) between a human demonstration and the robot morphology, possibly followed by IK for joint recovery (Wei et al., 27 Oct 2025).
3. Learning Paradigms, Losses, and Generalization
Cross-embodiment grasping pipelines typically employ one or more of: reinforcement learning (RL), supervised learning on large-scale synthetic or imitation datasets, or diffusion-based generative modeling.
- RL Formulations: Policies are trained via PPO (or similar methods), often with parallel environment rollouts, on reward functions that combine contact accuracy, object displacement, table collision penalties, and grasp stability, possibly including sparse binary terms for full success (Yuan et al., 2024, Yuan et al., 26 Sep 2025, Fay et al., 3 Dec 2025). DAgger is used for vision-based policy distillation in CrossDex.
- Diffusion and Generative Modeling: Recent methods including UniMorphGrasp and T(R,O) Grasp apply denoising diffusion probabilistic models (DDPMs) to grasp representation spaces, learning conditional generation of stable and diverse grasps given both hand graph encoding and object geometry. Hierarchical morphology-aware losses, which weight joint-level supervision by position in the hand kinematic tree, improve zero-shot and cross-morphology accuracy (Wu et al., 31 Jan 2026, Fei et al., 14 Oct 2025).
- Physics-Aware and Kinematically-Weighted Losses: Several methods impose additional losses tailored to grasp stability and physical plausibility. These include surface-pulling, external-penetration, and self-penetration repulsion forces, as well as grasp diversity (variance in joint space) and explicit kinematic-aware articulation losses (e.g., weighting joints by their influence on fingertip position) (Wu et al., 29 Sep 2025, Zhang et al., 7 Oct 2025).
- Imitation Learning and Demonstration-Driven Generalization: Systems such as ACE-F acquire robustly general policies by collecting demonstration data (teleoperation with force feedback) across diverse robot hands and tasks, and then distilling these into behaviorally rich training trajectories (Yan et al., 25 Nov 2025). Foundation-model approaches generate task-specific grasp images or other intermediate representations which are retargeted to robot hands via differentiable pipelines (Wei et al., 27 Oct 2025).
4. Experimental Evaluation and Empirical Findings
Empirical studies demonstrate strong gains from unified morphology- and contact-aware architectures:
- Success Rates and Diversity: UniMorphGrasp achieves 94.0% average success in-domain and 91.3% zero-shot on novel morphologies (Wu et al., 31 Jan 2026). T(R,O) Grasp yields ~94.8% avg. success over multiple hands, with an inference speed of 0.21 s and high throughput (Fei et al., 14 Oct 2025). CrossDex vision-based policy enables 80% success across four hands and 35.2–39.1% zero-shot on two previously unseen morphologies (Yuan et al., 2024).
- Few-Shot and Zero-Shot Generalization: Most state-of-the-art models enable direct transfer to new hands with previously unseen kinematic structures. For example, Cross-Embodiment Dexterous Hand Articulation Generation attains 85.6% after few-shot adaptation to an unseen hand (Zhang et al., 7 Oct 2025). GeoMatch++ improves zero-shot success on out-of-domain hands by 9.64% over prior methods (Wei et al., 2024).
- Scalability and Dataset Scale: CEDex demonstrates large-scale synthesis: 20 million grasp samples across 500K objects and four hands, synthesizing 64 grasps in 7.8 s, enabling downstream learning on massive, morphology-diverse datasets (Wu et al., 29 Sep 2025).
- Real-World Transfer: Methods validated in physical trials include CrossDex, ACE-F, and UniMorphGrasp, with real-world success rates of 80–91% on YCB and other objects (Yuan et al., 2024, Yan et al., 25 Nov 2025, Wu et al., 31 Jan 2026).
| Approach | Success (sim/real) | Inference Time (s) | Zero-Shot Gen. |
|---|---|---|---|
| UniMorphGrasp (Wu et al., 31 Jan 2026) | 94.0% / 91% | 0.47 | Yes (80–98%) |
| CrossDex (Yuan et al., 2024) | 88.5–80% / N/A | — | Yes (35.2–39.1%) |
| T(R,O) Grasp (Fei et al., 14 Oct 2025) | 94.8% / 90–91% | 0.21 | Yes (>70%) |
| AnyDexGrasp (Fang et al., 23 Feb 2025) | up to 98% / — | — | Yes (multiple hands) |
| GeoMatch++ (Wei et al., 2024) | 71.7% / — | — | Yes (+22.5% w/ morph.) |
5. Morphology-Aware Losses, Limitations, and Design Implications
- Morphology Encoding: Hierarchical, structure-aware encodings—such as node masks, descendant counts, or graph encoding distances—significantly improve per-joint accuracy, grasp stability, and zero-shot transfer to novel designs (Wu et al., 31 Jan 2026, Wei et al., 2024).
- Physical Realizability and Failure Modes: Failure modes arise from insufficient morphology diversity in training (limiting generalization), visual occlusion of key contact points (for vision-based policies), or retargeting mismatches, especially for thin or non-anthropomorphic objects (Yuan et al., 2024, Wu et al., 31 Jan 2026). Physics constraints, while efficient, can lead to hand-object penetration on thin geometries if not strictly enforced (Wu et al., 31 Jan 2026, Wu et al., 29 Sep 2025).
- Role of Human-Like Contacts: Embedding human grasp priors via CVAEs and merging strategies, as in CEDex, improves both kinematic plausibility and optimization speed (Wu et al., 29 Sep 2025).
6. Extensions: Teleoperation, Closed-Loop Control, and Co-Design
- Teleoperation and Data Collection: Portable, force-feedback teleoperation systems, such as ACE-F, enable high-quality, morphologically-agnostic demonstrations without per-embodiment sensor redesign. Virtual force feedback enables direct data acquisition and intuitive control (Yan et al., 25 Nov 2025).
- Closed-Loop and Foundation Models: Conditioned diffusion models are capable of real-time, closed-loop operation (5 Hz) in dynamic environments. Foundation models provide user prompt–to–grasp pipelines via language and vision, supporting high-level semantic grasping across embodiments (Wei et al., 27 Oct 2025, Fei et al., 14 Oct 2025).
- Co-Design Frameworks: Integrated design+control optimization pipelines, exemplified by cross-embodied co-design (Fay et al., 3 Dec 2025), jointly search over modular hand architectures and cross-embodied policies, enabling rapid fabrication (sub-24h) and policy deployment without RL retraining.
7. Outlook and Open Challenges
Despite major advances, cross-embodiment dexterous grasping remains constrained by the limited diversity of training morphologies, challenges of real-world deployments (e.g., tactile feedback, extreme object shapes), and the computational cost of high-dimensional optimization or diffusion inference for extremely complex hands. Promising directions include:
- Incorporation of richer human demonstration priors, multi-object contact modeling, and tactile feedback signals.
- Adapting models to soft or continuum morphologies via generalized or learned topological merging and graph representations.
- Extending single-grasp pipelines to dynamic, functional, and multi-stage dexterous manipulations (e.g., in-hand reorientation, handover).
- Scaling foundation-model approaches to offer universal semantic-to-executable grasping across vast morphological and object-task diversity.
Recent results establish cross-embodiment dexterous grasping as a tractable and rapidly advancing domain, with state-of-the-art models now demonstrating robust zero-shot transfer, rapid inference, scalable dataset construction, and closed-loop deployment in both simulation and hardware (Yuan et al., 2024, Fang et al., 23 Feb 2025, Fei et al., 14 Oct 2025, Wu et al., 31 Jan 2026, Fay et al., 3 Dec 2025).