X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

Published 11 May 2025 in cs.RO, cs.AI, and cs.LG | (2505.07096v4)

Abstract: Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at https://portal-cornell.github.io/X-Sim/.

Abstract PDF Upgrade to Chat

Summary

The paper introduces X-Sim, a framework that trains robot manipulation policies from human videos by leveraging object-centric rewards in photorealistic simulations.
It employs 2D Gaussian Splatting and online domain adaptation to transfer policies from simulation to real-world environments efficiently.
Experiments demonstrate a 30% improvement over baselines and 90% task success in low-data settings, underscoring robust cross-embodiment learning.

"X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real" (arXiv ID: (2505.07096))

Introduction

The paper introduces X-Sim, a framework designed to train robot manipulation policies using human videos, without the need for explicit action labels. Current approaches in imitation learning require paired human-robot demonstrations or attempt to directly translate human movements to robots, often failing due to embodiment mismatches. X-Sim leverages a real-to-sim-to-real methodology by generating photorealistic simulations from human video data to train RL policies. This paper focuses on utilizing object motion as a transferable supervisory signal, thus enabling robots to achieve similar tasks even when following different trajectories than humans.

Figure 1: X-Sim framework overview demonstrating the real-to-sim-to-real approach for policy learning from human videos.

Methodology

Real-to-Sim Transfer

The key innovation of X-Sim lies in reconstructing simulation environments from RGBD videos to track object states and define object-centric rewards. By extracting high-fidelity object meshes and tracking their trajectories over the video, it employs 2D Gaussian Splatting to construct accurate simulation environments. This procedure allows Q-learning agents to mimic the outcomes of the human demonstrations by utilizing object-centric rewards based on the trajectory divergence between the RL policy and the human demonstration.

Figure 2: The real-to-sim process captures object motion from RGBD videos and constructs a simulation environment to train RL policies.

Sim-to-Real Policy Transfer

Once the privileged-state RL policies are trained, X-Sim generates a synthetic dataset of image-action pairs. The approach involves rendering diversified scenes by varying object configurations and viewpoints. Subsequently, it distills these policies into an image-conditioned diffusion network that can be deployed in real-world scenarios. Sim-to-real transfer is achieved via online domain adaptation, comparing real and simulated rollouts to iteratively minimize the sim-to-real visual gap during policy execution.

Figure 3: Demonstration of the sim-to-real transfer process with domain adaptation minimizing observation discrepancies.

Experimental Results

Extensive evaluations demonstrate X-Sim's capability to translate policies from simulation to real-world scenarios effectively. In tests across five different manipulation tasks, X-Sim consistently outperformed benchmarks, achieving a 30% improvement over traditional hand-tracking and baseline sim-to-real approaches. The diffusion policy was particularly impactful in low-data settings, showing effective generalization with 90% task success using only a fraction of the data compared to traditional teleoperation.

Figure 4: Task performance metrics showing the average task progress of X-Sim against baseline methods.

Discussion

Data Efficiency and Robustness

The method outpaces behavior cloning in terms of data efficiency, reducing the need for teleoperator data by 10 times while matching performance. The framework’s synthetic data strategy efficiently addresses initial state distribution broadening, providing robustness to unseen tasks and novel camera viewpoints.

Figure 5: Visualization of task initial state distributions used for data efficiency assessment.

Future Work and Limitations

While demonstrating strong results, X-Sim currently relies on 3D object meshes for tracking, which limits deployment in unstructured environments. Future developments could focus on combining prior models with the framework for seamless adaptation to dynamic scenes, articulations, and deformables. Real-to-sim-to-real calibration refinement remains essential in minimizing real-world observational discrepancies.

Conclusion

X-Sim advances the field of robot policy learning by introducing a scalable, data-efficient platform capable of leveraging human videos as rich sources of information for cross-embodiment learning. Its innovations in simulation and domain adaptation provide a promising direction for incorporating less structured, real-world data, reducing dependency on extensive robot teleoperation. Such capabilities have the potential to significantly accelerate the development of adaptable, large-scale robot models.