MetaWorld: Robotic Manipulation Suite

Updated 28 January 2026

MetaWorld is a comprehensive benchmark suite with up to 50 distinct robotic manipulation tasks using a 7-DOF Sawyer arm.
It underpins research in meta-reinforcement learning, imitation learning, and generalist policies with few-shot adaptation and systematic evaluation.
Rigorous protocols and structured RL frameworks in MetaWorld foster advances in hierarchical architectures, multi-modal conditioning, and continual learning.

MetaWorld is a comprehensive suite of robotic manipulation benchmarks that has become a central testbed for research in meta-reinforcement learning, imitation learning, skill discovery, preference-based RL, diffusion policy, continual learning, and embodiment-agnostic planning across diverse manipulation tasks. MetaWorld’s design emphasizes task diversity, few-shot adaptation, systematic evaluation, and reproducibility, and supports rigorous development and comparison of generalist robotic agents.

1. Benchmark Structure and Task Protocols

MetaWorld consists of up to 50 distinct tabletop manipulation tasks for the 7-DOF Sawyer arm, including skills such as reaching, pushing, drawer open/close, peg-insertion, door manipulation, button-presses, basketball shooting, sweeping, and shelf placement. Environments furnish both dense shaped rewards (e.g., negative distance-to-goal plus a sparse endpoint success signal) and binary success metrics, with typical state observations comprising joint positions and velocities, end-effector kinematics, object positions, and optionally images. The most frequently studied splits are:

ML10: 10 tasks—8 for meta-training, 5 held-out for meta-testing (with 3 overlapping).
MT10/MT50: 10/50 tasks for multi-task learning, with randomization over object/goal positions.
ML45/ML05: 45-task train/5-task OOD split used for generalist and few-shot evaluation (Wei et al., 2023).
MetaWorld-v2: Supports all 50 tasks and is often the basis for modern structured skill-discovery, continual learning, and diffusion-policy studies.

Table: ML10 Split (as in (Atamuradov, 15 Nov 2025)) | Meta-Train Tasks | Meta-Test Tasks | |-------------------------|---------------------| | button-press-topdown | door-close | | drawer-close | drawer-open | | door-open | lever-pull | | peg-insert-side | shelf-place | | reach | sweep-into | | sweep | | | basketball | | | window-open | |

2. Meta-Learning and Fast Adaptation

Robust few-shot adaptation is a primary focus. Model-Agnostic Meta-Learning (MAML), evaluated with Trust Region Policy Optimization (TRPO), seeks an initialization $\theta$ that enables rapid per-task adaptation through a single or few policy-gradient steps (Atamuradov, 15 Nov 2025). The meta-objective for reinforcement learning tasks is to maximize post-adaptation expected return:

Inner loop: $\theta'_i = \theta - \alpha \nabla_\theta L_{T_i}(\theta)$ (one gradient, $\alpha=0.1$ ).

Outer loop: Optimize $\mathbb{E}_{T_i} [L_{T_i}(\theta'_i)]$ , implemented via TRPO with KL-constraint ( $\delta=0.01$ ).

Meta-training is performed on randomized variations of training tasks; meta-testing measures adaptation on held-out tasks.

Performance:

Converged one-step adaptation success: 21.0% (train) vs 13.2% (test).
Per-task variance: ranges from 0% (window-open, lever-pull) to 80% (door-open).
Generalization gap: test success plateaus as train performance climbs, attributed to limited training-task coverage, single-step bias, and overfitting to the meta-training set.

Ablation and protocol studies consistently find that increasing meta-training diversity (e.g., moving to ML50), task conditioning, and more structured skill decompositions alleviate the generalization gap (Atamuradov, 15 Nov 2025, Cho et al., 2024).

3. Structured RL Architectures and Skill Discovery

MetaWorld has catalyzed hierarchical and modular deep-RL methods that decompose policy learning into reusable, adaptable units:

Mixture Of Orthogonal Experts (MOORE) (Hendawy et al., 2023): Constructs a shared basis of $k$ maximally orthogonal feature “experts,” using Gram-Schmidt orthogonalization in the forward pass, with each task selecting a mixture via a learned task encoder. MOORE achieves 88.7% avg. success (MT10-rand) and 72.9% (MT50-rand), surpassing all prior multi-task RL baselines.
Parameterization and Skill Hierarchies (DEPS) (Gupta et al., 28 Oct 2025): DEPS discovers composable parameterized motor skills through meta-learning from expert demonstrations. A three-level structure parameterizes skills with discrete (skill type) and continuous (arguments) latents, bottlenecking low-level policies to prevent degeneracy. DEPS consistently outperforms multitask and skill-based baselines in few-shot transfer across MW-Vanilla and MW-PRISE splits.
Automated Macro-Action Discovery (HiMeta) (Cho et al., 2024): HiMeta builds a tri-level hierarchy with a high-level task encoder (GRU-based, categorical latent), an action-agnostic macro-action VAE in an “ego-state” (task-agnostic) space, and a low-level actor with PPO and intrinsic reward shaping. Macro-actions (learned latent z) are recombined and rapidly adapted to unseen tasks, yielding marked improvements over prior hierarchical meta-RL approaches.
Hierarchical World Models (Shen et al., 24 Jan 2026): Recent meta-control systems factorize decision making: a VLM-based semantic layer parses instructions, a latent world model fuses dynamic expert-policy selection, and a latent dynamics model enables planning. This paradigm achieves state-of-the-art returns and sample efficiency on complex humanoid manipulation tasks, exploiting a library of pre-trained primitive experts and dynamic adaptation.

4. Generalist Policies, Visual and Language Conditioning

Recent work demonstrates that transformer-based generalist policies, diffusion models, and VLMs can unlock task generalization and compositionality via multi-modal prompts:

DualMind (Wei et al., 2023): A dual-phase strategy first pre-trains on all ML45 tasks with unsupervised next-state/action/long-range masking losses, then adapts only decoder and prompt-fusion layers for task-conditioned imitation (via natural language). This approach yields a mean SR = 0.78 and solves ≥30/45 tasks at ≥90% success with zero-shot prompting.
VARP (Singh et al., 18 Mar 2025): Preference-based RL with VLMs leverages trajectory sketches overlaid on final observations, increasing VLM comparison fidelity (68%→84%) and, combined with agent-regularized preference learning, achieves 70–80% success on single MetaWorld tasks, nearly closing the gap to oracle reward labeling.
Diffusion-Based Policies: Fully-diffusive controllers, such as DAWN (Nguyen et al., 26 Sep 2025) and ISS Policy (Xia et al., 17 Dec 2025), model motion planning and control as iterative denoising in either pixel-dense motion space or 3D point cloud space, conditioned on scene geometry and language. DAWN delivers 65.4% mean task success (11 tasks), outperforming previous video- or flow-based approaches; ISS Policy scales success to 86.2% via implicit scene supervision and multi-scale DiT transformers.
DemoGen (Jin et al., 2024): Given a language instruction, a VLM expands prompts, a diffusion model generates demonstration videos, and an inverse dynamics model extracts actions. The generated (synthetic) demos yield up to 3× higher success on zero-shot tasks for policies trained via imitation, compared to expert-only demonstrations.
Contextual Planning Networks (CPN) (Rivera et al., 2021): Visual goal-based meta-learning performs latent-space planning using only goal images, employing neuromodulation and explicit goal-conditioning for zero-shot generalization, achieving notable first-attempt success rates (>60% on door-lock, >70% on drawer-close) in strictly visual settings.

5. Continual and Embodiment-Agnostic Learning

MetaWorld facilitates research on lifelong/continual learning and policies transferable across morphologies:

Continual RL and Orthogonality: Parseval Regularization (Chung et al., 2024) enforces weight orthogonality to maintain network plasticity, boosting median task success from 0.45 to 0.72 in continual (Metaworld20-10) protocols, preserving stable rank, policy entropy, and rapid adaptation after task changes.
Koopman Q-Learning (Weissenbacher et al., 2021): Leverages latent Koopman operators to infer system symmetries, using these to augment offline RL datasets by “symmetry-shifted” transitions. On ML1 tasks, performance improves by 5–15 pp over CQL and S4RL baselines.
Embodiment-Agnostic Planning (Tang et al., 2024): Object-part scene flow models predict 3D motion of manipulated parts from human or robot-agnostic videos, extracting SE(3) transforms applied via analytical IK for any robot. This yields a 70.8% overall success on 11 MetaWorld tasks, a 27.7% improvement on prior art, and robust real-to-sim (and human-to-robot) transfer.

6. Empirical Insights and Assessment Protocols

MetaWorld enables highly granular analysis of adaptation, generalization, and failure modes over many tasks and data regimes:

Task-wise metrics reveal systematic variance: certain skills (e.g., door-open, basketball, peg-insertion) respond robustly to meta-learning, while others (e.g., sweep-into, shelf-place) expose generalization failures.
Across nearly all benchmarks, bottlenecking, orthogonal skill learning, hierarchical decompositions, and explicit task/context conditioning improve data efficiency, asymptotic performance, and robustness to novel tasks and morphologies.
Preference-based, imitation, and diffusion approaches all benefit from tight integration of scene geometry and trajectory structure into model architectures and loss functions.

7. Open Challenges and Future Directions

Despite rapid progress, open issues remain:

Generalization gap: Overfitting to training task distribution and limited task coverage lead to stagnation on unseen tasks. Richer task sampling and meta-objective designs are needed (Atamuradov, 15 Nov 2025).
Embodiment transfer: Current methods are not fully robust to morphology/scene shifts, despite progress via object-centric scene flow (Tang et al., 2024).
Sample efficiency: Most state-of-the-art approaches still require large numbers of environment steps or demonstrations per task. Hierarchical reuse and hybrid meta-learning/inference frameworks are active research areas (Cho et al., 2024, Shen et al., 24 Jan 2026).
Reward Specification: Preference-based methods remain bottlenecked by label fidelity and annotation cost; VLM-based augmentations and agent regularization represent substantial gains but are not yet fully mature (Singh et al., 18 Mar 2025).
Real-world deployment: Robust sim-to-real transfer and evaluation under physical variability are limited; only a subset of methods have real-robot demonstrations or deployment results.

MetaWorld continues to drive advances at the intersection of structured RL, generalist control, zero/few-shot imitation, and robust transfer, fostering development and evaluation of broad robotic competencies across diverse manipulation skills and environments.