Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration

Published 17 Apr 2025 in cs.RO and cs.AI | (2504.12609v2)

Abstract: Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation, a process that is challenging to scale. Videos of human-object interactions are easier to collect and scale, but leveraging them directly for robot learning is difficult due to the lack of explicit action labels from videos and morphological differences between robot and human hands. We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies using only one RGB-D video of a human demonstrating a task. Our method utilizes reinforcement learning (RL) in simulation to cross the human-robot embodiment gap without relying on wearables, teleoperation, or large-scale data collection typically necessary for imitation learning methods. From the demonstration, we extract two task-specific components: (1) the object pose trajectory to define an object-centric, embodiment-agnostic reward function, and (2) the pre-manipulation hand pose to initialize and guide exploration during RL training. We found that these two components are highly effective for learning the desired task, eliminating the need for task-specific reward shaping and tuning. We demonstrate that Human2Sim2Robot outperforms object-aware open-loop trajectory replay by 55% and imitation learning with data augmentation by 68% across grasping, non-prehensile manipulation, and multi-step tasks. Project Site: https://human2sim2robot.github.io

Abstract PDF Upgrade to Chat

Summary

The paper introduces Human2Sim2Robot, a novel sim-to-real RL framework that bridges the human-robot embodiment gap for dexterous manipulation using only a single RGB-D video demonstration.
Its core methodology leverages object pose trajectory for dense, embodiment-agnostic rewards and pre-manipulation hand pose for efficient exploration during RL training.
Experimental results show significant performance gains over baselines and successful zero-shot sim-to-real transfer on a physical robot without task-specific reward tuning.

Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration

The paper introduces Human2Sim2Robot, a novel approach to training dexterous manipulation policies using reinforcement learning (RL) from a singular human RGB-D video demonstration. The research addresses the challenge of bridging the significant morphological differences between human and robot hands, thus overcoming the traditional hurdles associated with imitation learning (IL), which often necessitates extensive data collection through wearables or teleoperation.

Methodological Framework

At the core of this method is a real-to-sim-to-real framework that capitalizes on the rich data obtained from a single video demonstration without the labor-intensive requirements typical of IL frameworks. The approach leverages two task-specific elements extracted from the video: the object pose trajectory and the pre-manipulation hand pose.

Object Pose Trajectory: This trajectory provides a dense reward signal that is embodiment-agnostic, facilitating the effective learning of manipulation tasks in a simulated environment. The system's design focuses on the holistic movement of the object rather than the precise replication of human actions, offering flexibility in accommodating the robot's morphological constraints.
Pre-Manipulation Hand Pose: This component initiates the policy training, providing advantageous exploration states during RL training. Unlike typical IL frameworks that may struggle with embodiment disparities, this method provides a starting point that aligns with the robot's operational parameters, thus encouraging exploration that is not strictly bound to the constraints of human action authenticity.

The paper demonstrates that Human2Sim2Robot significantly outperforms other methods, such as object-aware open-loop trajectory replay and imitation learning with data augmentation by margins of 55% and 68%, respectively, across a range of tasks including grasping, non-prehensile manipulation, and complex, multi-step manipulation sequences.

Numerical and Experimental Insights

The experimental results underscore the system’s robustness, with the methodology requiring no task-specific reward tuning. This feature streamlines the learning process and circumvents the extensive task-specific engineering typically necessary in RL-based approaches. By deploying policies on a physical Kuka arm connected to an Allegro hand, the researchers validate the zero-shot sim-to-real transfer capabilities, thus showcasing the method's efficacy without further fine-tuning.

Theoretical and Practical Implications

From a theoretical perspective, this paper offers a robust framework for translating the high-dimensional space of human demonstrations into actionable and efficient robot training paradigms. Practically, it eliminates the barriers of entry associated with dataload intensive IL methods, presenting an accessible pathway for leveraging human demonstrations directly.

Future Directions

The research speculatively opens several avenues for the future development of AI in robotics, specifically in extending the framework to accommodate bimanual manipulations or even integrating similar methodologies into broader learning systems such as multi-task and multi-robot settings. Moreover, a detailed exploration of extending the model to manipulate objects with complex, deformable, or articulated structures could substantially enhance the robustness and applicability of robotics in diverse real-world environments.

In conclusion, Human2Sim2Robot stands as a significant contribution to the field of robot learning, offering a practical and theoretically sound framework for bridging the human-robot embodiment gap in dexterous manipulation tasks with minimal reliance on extensive datasets or task-specific reward shaping.

Markdown