World-Gymnast: Model-Based RL for Robotics
- World-Gymnast is a model-based RL framework that trains robot policies in a learned, action-conditioned video world model to enhance real-world performance.
- The approach leverages high-fidelity video modeling with VLM-based rewards and parallel rollouts, yielding substantial improvements over SFT and simulator baselines.
- It enables rapid test-time adaptation and iterative data refinement while addressing challenges like distributional shift, reward noise, and safety.
World-Gymnast is a model-based reinforcement learning (RL) framework that enables training of robot policies in a learned action-conditioned video world model. This approach seeks to address the inherent limitations of both supervised finetuning (SFT) from expert demonstrations and conventional simulator-based RL in the context of real-world robotic manipulation. By executing RL rollouts entirely within a high-fidelity video world model and employing a vision-LLM (VLM) as the reward function, World-Gymnast achieves substantial improvements in policy transfer to real robots, outperforming both SFT and simulator baselines on the Bridge robot benchmark (Sharma et al., 2 Feb 2026).
1. System Architecture
World-Gymnast consists of two major components: an action-conditioned video world model and a vision-language-action (VLA) policy. The world model is based on the WorldGym architecture [Quevedo et al., 2025], with the following structure:
- Input/Output: Each time step receives a 256×256 RGB frame and a 7D real-valued robot action (6-DOF end-effector pose and binary gripper state), and predicts the next frame .
- Representation: Frames are encoded to latents using a pretrained VAE from Stable Diffusion 3 [Esser et al., 2024], then predicted forward using a DiT-style diffusion transformer (16 layers, hidden size 1024, 16 heads, MLP-ratio 4×), conditioned on the last latent-action pairs.
- Training Losses: Reconstructions are trained with VAE pixel loss and diffusion denoising loss :
Total loss: ().
- Efficiency: To accelerate RL rollouts, transformer attention keys/values are cached, reducing per-step compute by approximately .
The VLA policy is a large transformer (7B parameters) built upon OpenVLA-OFT [Kim et al., 2025], comprising:
- Visual encoder: CNN mapping 256×256 RGB images to tokens,
- Language encoder: Pretrained LLAMA-2 for instruction ,
- Cross-modal fusion: Attention between image and language representations,
- Action head: Autoregressive real-valued action outputs.
Policy initialization proceeds via supervised finetuning (SFT) from OpenVLA, further finetuned on expert BridgeData V2 trajectories for 20k steps.
2. Model-Based Reinforcement Learning in Imagination
World-Gymnast conducts fully model-based RL by rolling out in the world model and optimizing expected reward:
where is binary (success/failure), evaluated trajectory-wide via a VLM.
Policy gradient optimization adopts Group Relative Policy Optimization (GRPO) [Shao et al., 2024]:
- Generate parallel rollouts in the world model.
- Each rollout is scored by submitting sampled frames to GPT-4o, prompted with the current instruction .
- Compute group reward mean and std ; normalize advantage per trajectory: , assigned to all timesteps in .
- Minimize a clipped surrogate loss (PPO-style):
with . Hyperparameters: learning rate ; , ; ; temperature 1.6.
The VLM reward uses GPT-4o prompted with sampled frames (downsampled by stride 3); the final binary score is a majority vote over 5 rollouts.
3. Empirical Evaluation and Baseline Comparison
World-Gymnast is evaluated on the Bridge robot—WidowX arm in drawer-cabinet and sink-counter scenes—with the AutoEval protocol Zhou et al., 2025. Baselines include SFT (only expert demonstration), Iter-SFT (two rounds with synthetic rollouts), and SIMPLER (simulator RL with dense reward) [Li et al., 2024].
| Task | SFT Success | SIMPLER Success | World-Gymnast Success |
|---|---|---|---|
| Open drawer | — | 34±7% | 58±4% |
| Close drawer | — | 74±5% | 62±6% |
| Put eggplant in blue sink | 4±4% | 32±10% | 72±10% |
| Put eggplant in yellow basket | 8±4% | 40±10% | 78±2% |
World-Gymnast outperforms SFT by as much as 18× (blue sink) and simulator-based RL by up to 2×, with the caveat that close drawer is higher for SIMPLER (74%) than World-Gymnast (62%).
4. Extensions: Generalization, Adaptation, and Iterative Data-Flywheel
World-Gymnast exhibits several advanced capabilities:
- Diverse Policy Training: RL can be initialized from arbitrary images within the world model's support, enabling recovery behavior learning; cluttered "distractor" scenes (WG-Distract) achieve 78±2% (vs. 74±3% original); OOD instructions (WG-Language) yield 81±1% success; additional BridgeData tasks (WG-Scaled) reach 81±4%.
- Test-Time Adaptation: Given a novel test-scene frame and instruction , a brief RL episode in WorldGym yields rapid adaptation. Example: close-drawer improves from 62±6% to 100% success on a single test task. However, such adaptation currently overfits and does not generalize to new tasks without further research.
- Iterative Improvement: Integrating real-robot rollouts, expanding the world model dataset, fine-tuning the world model, and then RL-finetuning further reduces sim-to-real artifacts and increases close-drawer real-world success to 95%.
5. Limitations and Open Challenges
Noted limitations include:
- Distributional Shift: WorldGym rollouts degrade if lies outside the world model's training distribution, necessitating both broad pretraining and frequent real-robot data augmentation.
- Reward Model Noise: GPT-4o can misclassify task success, introducing reward stochasticity. Mitigation requires research into learned reward models [Lee et al., 2026] and reward shaping.
- Sparse Rewards: Binary success signals are susceptible to reward hacking and inefficiency; future work could leverage subgoal or hierarchical reward structures.
- Safety: Hallucinated physics in the world model means learned policies may exploit model inaccuracies; real-world safety verification remains necessary, especially in critical applications.
6. Significance and Future Directions
World-Gymnast demonstrates that model-based RL in learned, video-conditioned world models—with VLM-based rewards and scalable vision-language-action policies—can substantially improve the transfer of policy performance to real robots. The approach allows training from broad real-world video-action logs (Open X-Embodiment), supports robust generalization, enables rapid test-time adaptation, and establishes a data flywheel for continual improvement.
A plausible implication is that cloud-based model learning and large-scale world model RL will become a standard paradigm for robotics, complementing or replacing traditional SFT and simulator-based RL approaches as model and reward fidelity increase. Remaining challenges include expanding world model generalization, reducing reward noise, automating safety verification, and enabling true multi-task test-time adaptation (Sharma et al., 2 Feb 2026).