RoboTwin2.0: Scalable Bimanual Simulation

Updated 3 February 2026

RoboTwin2.0 is a large-scale simulation framework that integrates expert data synthesis, structured domain randomization, and unified evaluation protocols for bimanual robotic manipulation.
It employs a closed-loop pipeline with multimodal LLM-driven task-code synthesis and simulation-in-the-loop refinement to enhance policy performance and sim-to-real transfer.
The platform benchmarks diverse dual-arm manipulation tasks with extensive expert trajectories, driving significant improvements in robotic policy success rates.

RoboTwin2.0 is a large-scale, domain-randomized simulation framework designed for scalable data generation, benchmarking, and transfer learning in robust bimanual robotic manipulation. It provides a closed-loop ecosystem that integrates multimodal expert data synthesis, structured domain randomization, and unified evaluation protocols across a diverse set of robot embodiments and manipulation tasks. As of its latest release, RoboTwin2.0 has been adopted as the canonical benchmark in high-profile competitions, systematic policy evaluation, and large-scale sim-to-real research (Chen et al., 22 Jun 2025, Chen et al., 29 Jun 2025, Li et al., 11 Sep 2025, Liang et al., 30 Nov 2025).

1. Core System Architecture and Data Generation Pipeline

RoboTwin2.0 is built as a closed-loop expert data generation platform structured in three key stages:

Task-Code Synthesis: Given a natural language goal, a multimodal LLM (MLLM) programmatically generates high-level Python code by composing API calls from a standardized skill set (e.g., move_arm, grasp, place). This enables automated interpretation and translation of abstract task descriptions into executable robot behaviors.
Simulation-in-the-Loop Refinement: The candidate program is executed multiple times (N=10) in the simulator. A vision-LLM (VLM) observer diagnoses execution failures (e.g., API misuse, logic errors, grasp misalignments) and provides targeted feedback for program repair. The synthesis-refinement loop iterates up to five times, terminating once a 50% success rate is observed or the maximum number of iterations is reached. The formal termination condition is $R_\text{task} \geq 0.5$ , where $R_\text{task} = (1/N)\sum_{i=1}^N R_i$ and $R_i$ averages binary successes over $M$ executions (Chen et al., 22 Jun 2025).
Trajectory Generation and Evaluation: Upon validation, the synthesized program is unrolled under domain randomization to generate 400 "hard" and 100 "clean" trajectories for each task–embodiment pair. This results in a dataset of expert demonstrations designed to capture broad variations and support generalizable policy learning.

A central resource of RoboTwin2.0 is the RoboTwin-OD object library: 731 3D models across 147 categories (534 captured in-house, 153 from Objaverse, 44 from SAPIEN PartNet-Mobility), each annotated with 15 language descriptions, affordance keypoints, and intra-class similarity groupings. This library underpins task diversity, scene cluttering, and semantic robustness in simulation (Chen et al., 22 Jun 2025).

2. Structured Domain Randomization for Robustness

RoboTwin2.0 implements systematic domain randomization along five axes to promote sim-to-real transfer and policy robustness:

Scene Clutter: Distractors drawn from RoboTwin-OD, ensuring non-overlap with task objects, and spatially sampled with uniform pose distributions while avoiding collision.
Background Textures: 12,000 procedurally/genetically synthesized images (uniform random selection).
Lighting: For each light source, temperature sampled $T_k \sim \text{Uniform}(2700\,\text{K},\,6500\,\text{K})$ , intensity $I_k \sim \text{Uniform}(0.5,\,2.0)$ , and position $p_k$ sampled from a 1 m × 1 m × 0.5 m cuboid.
Table Height: $h_\text{table} \sim \text{Uniform}(0.7\,\text{m},\,0.84\,\text{m})$ .
Language Instructions: Templates drawn from a grammar with 60 templates and 15 descriptions per object, with random selection for each instantiation.

Each randomized realization sets $x^* = x_0 + \delta$ , with $\delta$ drawn from a specified distribution per parameter. This approach forgoes adversarial objectives in favor of maximal, unbiased variability (Chen et al., 22 Jun 2025).

3. Benchmark Tasks, Embodiments, and Evaluation Protocols

RoboTwin2.0 defines a suite of 50 dual-arm benchmark tasks requiring coordinated pick-and-place, tool use, articulated and deformable-object manipulation, and high-precision alignments. The tasks encompass challenges such as "Adjust Bottle", "Beat Block Hammer", "Stack Blocks (Three)", "Handover Block", and "Rotate QR code" (Chen et al., 22 Jun 2025). The full task list covers articulated-object manipulation, tool use, coordinated handovers, clutter resilience, and fine alignment.

Supported robot embodiments include Franka Emika Panda (7 DoF), UR5 (6 DoF), AgileX Piper (6 DoF), ARX-X5 (6 DoF), and Aloha-AgileX (6 DoF mobile manipulator), each with reachability-aware grasp adaptation policies. Object affordances and approach axes are annotated per embodiment (Chen et al., 22 Jun 2025).

For each embodiment and task, RoboTwin2.0 provides:

100 clean trajectories (no randomization)
400 hard trajectories (full randomization)

This totals 125,000 trajectories (50 tasks × 5 robots × 500 trajectories) with uniform semantic and physical coverage.

Evaluation employs binary task-level scoring: a completed task yields full points, partial progress yields zero (Simulation Round 2 rules) (Chen et al., 29 Jun 2025). Average success rates are computed over 100–1000 held-out scenes per task, reflecting stratified randomization (Liang et al., 30 Nov 2025).

4. Technical Features and API Exposure

The platform utilizes a bidirectional "digital twin" paradigm, pairing real-world human demonstrations with high-fidelity simulated replicas. Key modules are pluggable:

Visual Rendering: Textured, randomized backgrounds and dynamic lighting.
Physics Simulation: Rigid-body (solved at 1000 Hz) and deformable-body (FEA, mesh subdivision).
Task and Instruction Pipelines: Language-conditioned, LLM-driven task parsing and program synthesis.
Data Collection: Unlimited self-play or demonstration replay, with Python API (example below): $R_\text{task} = (1/N)\sum_{i=1}^N R_i$ 5
Realtime Rendering & Inference: Single RTX 4090 GPU, batch parallelism, and robot-in-the-loop evaluation (Chen et al., 29 Jun 2025).

Physics randomization (unique to 2.0) includes friction $R_\text{task} = (1/N)\sum_{i=1}^N R_i$ 0, mass variation $R_\text{task} = (1/N)\sum_{i=1}^N R_i$ 1, and joint damping perturbations (Chen et al., 29 Jun 2025).

5. Empirical Outcomes and Policy Generalization

The automated, closed-loop MLLM+VLM data generator achieves a +10.9% absolute gain in task-code success rate over RoboTwin 1.0 (60.4%→71.3% ASR) (Chen et al., 22 Jun 2025). Fine-tuning vision-language-action (VLA) policies on 9,600 domain-randomized trajectories yields a +71.6% relative improvement in zero-shot success (14.8%→25.4%) and a +41.9% increase for the $R_\text{task} = (1/N)\sum_{i=1}^N R_i$ 2 baseline (21.0%→29.8%) on five held-out tasks. In real-world sim-to-real experiments, augmenting from 10 clean demos to 1,000 domain-randomized trajectories raised task success from 9.0% to 42.0% (absolute +33.0%, relative +366.7%). Synthetic-only policies achieved 29.5% (relative +228%) (Chen et al., 22 Jun 2025).

In the RoboTwin Dual-Arm Collaboration Challenge (CVPR 2025), top teams consistently approached perfect scores in domain-randomized six-task tracks, demonstrating the benchmark’s ability to drive generalizable bimanual policy learning (Chen et al., 29 Jun 2025).

Recent frameworks (e.g., MM-ACT, SimpleVLA-RL) report average bimanual task success rates of 52.38% (+9.25% from cross-modal learning) and 68.8%, substantially exceeding supervised and prior RL baselines in domain-randomized RoboTwin2.0 settings (Liang et al., 30 Nov 2025, Li et al., 11 Sep 2025).

6. Comparative Experimental Insights

Success Rates (selected models, RoboTwin2.0, eight tasks) (Liang et al., 30 Nov 2025):

Model	Avg Success Rate (%)
$R_\text{task} = (1/N)\sum_{i=1}^N R_i$ 3 Baseline	48.13
OpenVLA-OFT	23.13
MM-ACT (Vanilla)	43.13
MM-ACT (+Text)	46.50
MM-ACT (+Image)	48.75
MM-ACT (+T+I)	52.38

RL-based improvements (12 tasks, 100 rollouts/task) (Li et al., 11 Sep 2025):

Model	Short (%)	Med (%)	Long+XL (%)	Overall (%)
$R_\text{task} = (1/N)\sum_{i=1}^N R_i$ 4	45.5	58.8	43.3	49.2
OpenVLA-OFT	21.3	47.1	46.5	38.3
+RL (ours)	64.9	72.5	69.0	68.8

This suggests that reinforcement learning, group-relative normalization for binary rewards, and cross-modal decoding each provide substantial, complementary boosts in bimanual manipulation under strong domain shift.

7. Limitations, Extensions, and Open Resources

Limitations:

Manual 3D scene modeling induces object placement errors on the order of a few centimeters.
Real-time ray-tracing is computationally intensive but can be mitigated via precomputed ray databases (Andrei et al., 2024).
No explicit learning-based robustness metric is adopted; all evaluations use binary task success (Chen et al., 22 Jun 2025).

Future Directions:

Automatic environment capture via 3D LiDAR scanning and photogrammetry.
Online visual-inertial SLAM for dynamic registration.
Expansion to full-duplex ISAC, cooperative multi-robot systems, and industrial-scale environments (Andrei et al., 2024).

Resources:

RoboTwin-OD object library, expert data generator, benchmark code, simulator plugins, and evaluation suites are openly released under the Apache 2.0 license at https://robotwin-platform.github.io/ (Chen et al., 22 Jun 2025).

Collectively, RoboTwin2.0 constitutes a foundational infrastructure for learning and robust evaluation in bimanual manipulation, supporting rapid progression in sim-to-real transfer, multimodal policy architectures, and standardized manipulation research.