World-Gymnast: Model-Based RL for Robotics

Updated 5 February 2026

World-Gymnast is a model-based RL framework that trains robot policies in a learned, action-conditioned video world model to enhance real-world performance.
The approach leverages high-fidelity video modeling with VLM-based rewards and parallel rollouts, yielding substantial improvements over SFT and simulator baselines.
It enables rapid test-time adaptation and iterative data refinement while addressing challenges like distributional shift, reward noise, and safety.

World-Gymnast is a model-based reinforcement learning (RL) framework that enables training of robot policies in a learned action-conditioned video world model. This approach seeks to address the inherent limitations of both supervised finetuning (SFT) from expert demonstrations and conventional simulator-based RL in the context of real-world robotic manipulation. By executing RL rollouts entirely within a high-fidelity video world model and employing a vision-LLM (VLM) as the reward function, World-Gymnast achieves substantial improvements in policy transfer to real robots, outperforming both SFT and simulator baselines on the Bridge robot benchmark (Sharma et al., 2 Feb 2026).

1. System Architecture

World-Gymnast consists of two major components: an action-conditioned video world model and a vision-language-action (VLA) policy. The world model is based on the WorldGym architecture [Quevedo et al., 2025], with the following structure:

Input/Output: Each time step receives a 256×256 RGB frame $x_t$ and a 7D real-valued robot action $a_t$ (6-DOF end-effector pose and binary gripper state), and predicts the next frame $x_{t+1}$ .
Representation: Frames are encoded to latents $z_t \in \mathbb{R}^{4096}$ using a pretrained VAE from Stable Diffusion 3 [Esser et al., 2024], then predicted forward using a DiT-style diffusion transformer (16 layers, hidden size 1024, 16 heads, MLP-ratio 4×), conditioned on the last $k=20$ latent-action pairs.
Training Losses: Reconstructions are trained with VAE pixel loss $L_\text{vae}$ and diffusion denoising loss $L_\text{diff}$ :

$L_\text{vae} = \mathbb{E}_{x \sim \text{data}} [\|x - \text{Dec}(\text{Enc}(x))\|^2]$

$L_\text{diff} = \mathbb{E}_{z,\epsilon \sim \mathcal{N}(0,I),t} [\|\epsilon - \epsilon_\theta(z_t, a_t, t)\|^2]$

Total loss: $L = L_\text{vae} + \lambda L_\text{diff}$ ( $\lambda=1$ ).

Efficiency: To accelerate RL rollouts, transformer attention keys/values are cached, reducing per-step compute by approximately $10\times$ .

The VLA policy $\pi_\theta(o_t, g)$ is a large transformer (7B parameters) built upon OpenVLA-OFT [Kim et al., 2025], comprising:

Visual encoder: CNN mapping 256×256 RGB images to tokens,
Language encoder: Pretrained LLAMA-2 for instruction $g$ ,
Cross-modal fusion: Attention between image and language representations,
Action head: Autoregressive real-valued action outputs.

Policy initialization proceeds via supervised finetuning (SFT) from OpenVLA, further finetuned on expert BridgeData V2 trajectories for 20k steps.

2. Model-Based Reinforcement Learning in Imagination

World-Gymnast conducts fully model-based RL by rolling out $\pi_\theta$ in the world model and optimizing expected reward:

$J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{H-1} \gamma^t r(\tau, g) \right] \approx \mathbb{E}_{\tau_k} [r(\tau_k, g)]$

where $r$ is binary (success/failure), evaluated trajectory-wide via a VLM.

Policy gradient optimization adopts Group Relative Policy Optimization (GRPO) [Shao et al., 2024]:

Generate $K=8$ parallel rollouts $\tau_k$ in the world model.
Each rollout is scored $r_k \in \{0,1\}$ by submitting sampled frames to GPT-4o, prompted with the current instruction $g$ .
Compute group reward mean $\mu$ and std $\sigma$ ; normalize advantage per trajectory: $A_k = (r_k - \mu)/(\sigma + \epsilon)$ , assigned to all timesteps $t$ in $\tau_k$ .
Minimize a clipped surrogate loss (PPO-style):

$L_\text{CLIP}(\theta) = -\mathbb{E}_{k,t}\left[ \min \left( \rho_{t,k}(\theta) A_k, \;\text{clip}(\rho_{t,k}(\theta), 1-\epsilon_\text{low}, 1+\epsilon_\text{high})A_k \right) \right]$

with $\rho_{t,k}(\theta) = \pi_\theta(a_{t,k} | o_{t,k}, g)/\pi_{\theta_\text{old}}(…)$ . Hyperparameters: learning rate $5 \times 10^{-6}$ ; $\epsilon_\text{low}=0.2$ , $\epsilon_\text{high}=0.28$ ; $H=40$ ; temperature 1.6.

The VLM reward uses GPT-4o prompted with sampled frames (downsampled by stride 3); the final binary score is a majority vote over 5 rollouts.

3. Empirical Evaluation and Baseline Comparison

World-Gymnast is evaluated on the Bridge robot—WidowX arm in drawer-cabinet and sink-counter scenes—with the AutoEval protocol Zhou et al., 2025. Baselines include SFT (only expert demonstration), Iter-SFT (two rounds with synthetic rollouts), and SIMPLER (simulator RL with dense reward) [Li et al., 2024].

Task	SFT Success	SIMPLER Success	World-Gymnast Success
Open drawer	—	34±7%	58±4%
Close drawer	—	74±5%	62±6%
Put eggplant in blue sink	4±4%	32±10%	72±10%
Put eggplant in yellow basket	8±4%	40±10%	78±2%

World-Gymnast outperforms SFT by as much as 18× (blue sink) and simulator-based RL by up to 2×, with the caveat that close drawer is higher for SIMPLER (74%) than World-Gymnast (62%).

4. Extensions: Generalization, Adaptation, and Iterative Data-Flywheel

World-Gymnast exhibits several advanced capabilities:

Diverse Policy Training: RL can be initialized from arbitrary images $o_0$ within the world model's support, enabling recovery behavior learning; cluttered "distractor" scenes (WG-Distract) achieve 78±2% (vs. 74±3% original); OOD instructions (WG-Language) yield 81±1% success; additional BridgeData tasks (WG-Scaled) reach 81±4%.
Test-Time Adaptation: Given a novel test-scene frame $o_0$ and instruction $g$ , a brief RL episode in WorldGym yields rapid adaptation. Example: close-drawer improves from 62±6% to 100% success on a single test task. However, such adaptation currently overfits and does not generalize to new tasks without further research.
Iterative Improvement: Integrating real-robot rollouts, expanding the world model dataset, fine-tuning the world model, and then RL-finetuning $\pi_\theta$ further reduces sim-to-real artifacts and increases close-drawer real-world success to 95%.

5. Limitations and Open Challenges

Noted limitations include:

Distributional Shift: WorldGym rollouts degrade if $o_0$ lies outside the world model's training distribution, necessitating both broad pretraining and frequent real-robot data augmentation.
Reward Model Noise: GPT-4o can misclassify task success, introducing reward stochasticity. Mitigation requires research into learned reward models [Lee et al., 2026] and reward shaping.
Sparse Rewards: Binary success signals are susceptible to reward hacking and inefficiency; future work could leverage subgoal or hierarchical reward structures.
Safety: Hallucinated physics in the world model means learned policies may exploit model inaccuracies; real-world safety verification remains necessary, especially in critical applications.

6. Significance and Future Directions

World-Gymnast demonstrates that model-based RL in learned, video-conditioned world models—with VLM-based rewards and scalable vision-language-action policies—can substantially improve the transfer of policy performance to real robots. The approach allows training from broad real-world video-action logs (Open X-Embodiment), supports robust generalization, enables rapid test-time adaptation, and establishes a data flywheel for continual improvement.

A plausible implication is that cloud-based model learning and large-scale world model RL will become a standard paradigm for robotics, complementing or replacing traditional SFT and simulator-based RL approaches as model and reward fidelity increase. Remaining challenges include expanding world model generalization, reducing reward noise, automating safety verification, and enabling true multi-task test-time adaptation (Sharma et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

World-Gymnast: Training Robots with Reinforcement Learning in a World Model (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to World-Gymnast.