RLBench: Robot Learning Benchmark
- RLBench is a simulation suite comprising 100+ hand-crafted tasks designed to evaluate vision-based robotic manipulation.
- It integrates diverse sensory inputs—including RGB, depth, segmentation, and proprioception—to support deterministic and reproducible experiments.
- Its Python API enables seamless plug-in of reinforcement, imitation, and meta-learning pipelines for scalable, data-driven robotics research.
RLBench (Robot Learning Benchmark and Learning Environment) is a large-scale, extensible simulation suite designed to advance research in vision-based robotic manipulation. It provides a diverse set of over 100 hand-crafted tasks with varying complexity, a comprehensive set of sensory modalities (RGB, depth, and segmentation from multiple cameras; proprioceptive state), and an efficient mechanism for generating and validating expert demonstrations based on motion planners. RLBench has become a canonical testbed for data-driven policy learning in robotic manipulation, supporting reinforcement learning (RL), imitation learning (IL), multi-task/meta-learning, geometric policy pipelines, and hybrid architectures (James et al., 2019).
1. Formal Structure and API
Each RLBench task is formulated as an episodic Markov Decision Process, , where the state consists of joint positions, velocities, torques, end-effector 6D pose, and visual observations from two well-defined camera setups: "over-the-shoulder" stereo and "eye-in-hand" monocular. Action spaces are user-selectable and include joint positions, velocities, torques, or end-effector translations/orientations, with both absolute and delta modes supported. The benchmark operates atop the CoppeliaSim (formerly V-REP) simulator using PyRep, enabling deterministic transition dynamics and flexible episode randomization (James et al., 2019).
The environment supports reproducible experiment scripting via a Python API:
1 2 3 4 5 6 7 8 9 |
from rlbench.environment import Environment from rlbench.action_modes import ActionMode env = Environment('/path/to/demos', ActionMode.ABS_JOINT_VELOCITY) env.launch() task = env.sample_task() demos = task.get_demos(5) descriptions, obs = task.reset() obs, r, done = task.step(action) env.shutdown() |
The implementation enables seamless integration with RL and IL pipelines, providing programmatic access to scene elements, procedural variation, and large-scale demonstration data.
2. Task Diversity, Demonstrations, and Benchmarks
RLBench comprises 100+ rigorously specified tasks spanning basic kinematic goals (e.g., reach target), contact-rich multi-stage activities (open door, empty dishwasher), and long-horizon sequential manipulation (insert peg, stack blocks, put books on shelf) (James et al., 2019). Each task supports infinite demonstration generation by motion planning through pre-specified waypoints and IK, guaranteeing collision-free trajectories. This capacity supports data-driven learning even in the absence of human teleoperation.
Task success is always defined via a programmatic predicate evaluated at the episode's terminal state; reward is typically binary for task completion and zero otherwise, with optional sparse intermediate reward toggles for shaping, as in modern policy evaluation (James et al., 2019, Chen et al., 11 Jan 2025). Evaluation protocols include multi-episode success rate, with explicitly defined few-shot splits (90 train, 10 test tasks; demos per unseen task) enabling rigorous benchmarking of generalization and meta-learning algorithms.
3. Sensory Modalities and Observational Fidelity
RLBench's observation model supports rich, multi-view sensory input for high-fidelity manipulation. The standard visual suite includes:
- Over-the-shoulder (stereo): Two fixed RGB cameras (with synchronized depth and segmentation masks), facilitating stereo geometry and occlusion-robust policies.
- Eye-in-hand (monocular): One wrist-mounted RGB-D camera, providing peripersonal observation for fine manipulation.
All images are available in configurable pixel resolutions (typically 128×128–640×480), with pixelwise segmentation by object instance/mask for vision models. Proprioception includes joint angles, velocities, torques, and gripper state, supporting hybrid perception models (James et al., 2019, Chen et al., 11 Jan 2025, Huang et al., 2024). Modern approaches fuse these streams via point-cloud backprojection (Chen et al., 2023, Chen et al., 18 Dec 2025) or multi-view renderings (Wang et al., 2024) for action prediction.
Ablation studies across methods repeatedly confirm that multi-view coverage substantially outperforms single-view—for example, success rates in PolarNet scale from 35–48% (single view) to 92% (all views) on 10 benchmark tasks (Chen et al., 2023).
4. Algorithmic Approaches and Comparative Results
RLBench underpins the evaluation of a broad spectrum of policy learning algorithms:
(a) Value-Based RL (Coarse-to-Fine)
State-of-the-art value-based methods, notably Coarse-to-fine Q-Network (CQN) (Seo et al., 2024) and its sequence variant (CQN-AS) (Seo et al., 2024), discretize the action space hierarchically: at each level, the critic selects among candidate bins per dimension, progressively zooming into a fine-grained region. This procedure supports sample-efficient RL with only levels and bins, enabling robust delta-joint control in sparse-reward settings. CQN attains average success over 20 RLBench tasks after only environment steps, dramatically outperforming DrQ-v2+ (40%) and demonstration-driven BC baselines (below 50%) (Seo et al., 2024). CQN-AS further increases data efficiency and performance on long-horizon tasks by predicting multi-step action sequences, yielding 20–40 point gains over CQN on the most challenging manipulations (Seo et al., 2024).
(b) Autoregressive Transformers
Autoregressive Policy (ARP) (Zhang et al., 2024), built atop the Chunking Causal Transformer (CCT), models hybrid action sequences and supports variable “chunk” sizes for continuous and discrete components (e.g., 2D waypoints, gripper flag). On 18 RLBench tasks, ARP achieves average success, outperforming the prior RVT-2 (), with ARP (an enlarged model) reaching . Notably, ARP matches or exceeds previous SoTA across the most challenging tasks, including a near tripling of peg-insertion performance from (RVT-2) to (ARP) (Zhang et al., 2024).
(c) Geometric and Registration Pipelines
Match Policy (Huang et al., 2024) employs a non-learning, point-cloud registration approach for keyframe-based manipulation. Each manipulation is converted into a rigid registration problem between current segmented object clouds and key-poses from demonstration. Leveraging equivariances, Match Policy achieves high sample efficiency: on 6 RLBench tasks, even a single demonstration yields near-optimal completion (e.g., on Phone-on-Base), matching or exceeding learned baselines on challenging, high-precision cases (Huang et al., 2024).
(d) Multi-View World Models and LLM-Driven Pipelines
RoboHorizon (Chen et al., 11 Jan 2025) demonstrates the leading edge in integrating LLMs (GPT-4o) for both dense reward synthesis and multi-stage plan decomposition. Its Recognize-Sense-Plan-Act (RSPA) architecture leverages LLM-generated dense rewards at per-subtask granularity, keyframe discovery via velocity-based heuristics, and planning within a learned MV-MAE + RSSM world model using DreamerV2 optimization. Evaluated on 10 RLBench tasks (4 short, 6 long-horizon), RoboHorizon achieves $73.9$– on short-horizon and $31.2$– on complex, multi-stage manipulations, outperforming the strongest baselines by and absolute (averaged over task groups) (Chen et al., 11 Jan 2025).
(e) Foundation Model–Guided and Virtual-Eye Approaches
Recent works leverage foundation models for spatial reasoning and depth-aware viewpoints:
- VIHE (Wang et al., 2024) autoregressively renders virtual in-hand views conditioned on intermediate action predictions, yielding average success over 18 RLBench tasks with 100 demonstrations—an improvement of points versus previous SOTA (Wang et al., 2024).
- VERM (Chen et al., 18 Dec 2025) invokes GPT-4o to select a single, task-adaptive orthographic view from the merged point cloud, drastically reducing input redundancy and occlusion compared to multi-camera raw fusion. VERM’s dynamic coarse-to-fine module further boosts precision only when needed. On 17 RLBench tasks, VERM achieves average success rate, with training and inference speedups of and , respectively, over RVT-2. Each component (virtual camera, C2F, resolution, zoom) is ablated and shown to be essential for SOTA performance (Chen et al., 18 Dec 2025).
(f) 3D Point Cloud and Multimodal Fusion Policies
PolarNet (Chen et al., 2023) fuses multi-view 3D point clouds with language via a multimodal Transformer. It achieves single-task, large-scale (74-task), and multi-variation (18-task) RLBench success rates, consistently outperforming both 2D and voxel-based policies. Multi-view ablation shows up to deficit when dropping to a single camera (Chen et al., 2023).
5. Grasp-Centric and Correction Modules
Analysis of learned policies on RLBench reveals that unstable or imprecise grasps are a principal bottleneck even in strong policies. GraspCorrect (Lee et al., 19 Mar 2025) introduces a plug-and-play correction module employing VLM-guided (ChatGPT-4o) iterative visual prompting, object-aware sampling, and goal-conditioned BC-based action denoising. Inserted at the grasp execution step, it increases average task completion on 18 RLBench tasks by $5.5$–$18.3$ points across multiple policy models and raises challenging-task success (e.g., insert peg: ). Ablations confirm that both grasp-guided prompting and constraint-enforced sampling are essential (Lee et al., 19 Mar 2025).
6. Sample Efficiency, Real-World Transfer, and Limitations
Nearly all high-performing methods now demonstrate both high sample efficiency and some form of real-world transfer:
- C2F-ARM (James et al., 2021) and CQN attain simulated success and real-world convergence in min or a few tens of demonstrations.
- VIHE and PolarNet achieve top simulation scores and transfer to real robots with only $10$–$20$ physical demonstrations per task, showing comparable or modestly reduced success rates.
- Match Policy, due to its geometric equivariance, attains completion on unstructured real tasks from only $10$ kinesthetic demos (Huang et al., 2024).
- VERM’s architecture, although evaluated primarily in simulation, is structured to facilitate efficient inference and sim-to-real adaptation (Chen et al., 18 Dec 2025).
Known limitations are reported in occlusions, sparsity of RGB+D for contact-rich or deformable objects, reliance on precise segmentation for geometry-based methods, and the need for robust closed-loop feedback in dynamic or high-precision assembly (Chen et al., 2023, Huang et al., 2024, Lee et al., 19 Mar 2025).
7. Contributions and Impact on Robotic Manipulation Research
RLBench has accelerated progress in vision-based manipulation by enforcing standardization of (i) observation and action conventions, (ii) large-scale, reproducible benchmarking across task-matched splits, and (iii) the separation of demonstration, RL, and meta-learning protocols. It supports fair head-to-head evaluation (e.g., RVT/ARP, CQN/CQN-AS) and drives developments in:
- Scalable multi-task learning (74+ tasks)
- Long-horizon sequential planning with sparse success
- Fusion of language, vision, and proprioception
- Integration of LLMs/foundation models for reward supervision and viewpoint selection
Modern methods frequently report RLBench results as primary evidence of performance and generalization, and the benchmark remains a central reference for manipulation policy evaluation (Chen et al., 11 Jan 2025, Zhang et al., 2024, Chen et al., 18 Dec 2025, Chen et al., 2023). Its extensible API and validation toolchain enable open community contributions and scaling to new manipulation domains.