Papers
Topics
Authors
Recent
Search
2000 character limit reached

World-Gymnast: Model-Based RL for Robotics

Updated 5 February 2026
  • World-Gymnast is a model-based RL framework that trains robot policies in a learned, action-conditioned video world model to enhance real-world performance.
  • The approach leverages high-fidelity video modeling with VLM-based rewards and parallel rollouts, yielding substantial improvements over SFT and simulator baselines.
  • It enables rapid test-time adaptation and iterative data refinement while addressing challenges like distributional shift, reward noise, and safety.

World-Gymnast is a model-based reinforcement learning (RL) framework that enables training of robot policies in a learned action-conditioned video world model. This approach seeks to address the inherent limitations of both supervised finetuning (SFT) from expert demonstrations and conventional simulator-based RL in the context of real-world robotic manipulation. By executing RL rollouts entirely within a high-fidelity video world model and employing a vision-LLM (VLM) as the reward function, World-Gymnast achieves substantial improvements in policy transfer to real robots, outperforming both SFT and simulator baselines on the Bridge robot benchmark (Sharma et al., 2 Feb 2026).

1. System Architecture

World-Gymnast consists of two major components: an action-conditioned video world model and a vision-language-action (VLA) policy. The world model is based on the WorldGym architecture [Quevedo et al., 2025], with the following structure:

  • Input/Output: Each time step receives a 256×256 RGB frame xtx_t and a 7D real-valued robot action ata_t (6-DOF end-effector pose and binary gripper state), and predicts the next frame xt+1x_{t+1}.
  • Representation: Frames are encoded to latents ztR4096z_t \in \mathbb{R}^{4096} using a pretrained VAE from Stable Diffusion 3 [Esser et al., 2024], then predicted forward using a DiT-style diffusion transformer (16 layers, hidden size 1024, 16 heads, MLP-ratio 4×), conditioned on the last k=20k=20 latent-action pairs.
  • Training Losses: Reconstructions are trained with VAE pixel loss LvaeL_\text{vae} and diffusion denoising loss LdiffL_\text{diff}:

Lvae=Exdata[xDec(Enc(x))2]L_\text{vae} = \mathbb{E}_{x \sim \text{data}} [\|x - \text{Dec}(\text{Enc}(x))\|^2]

Ldiff=Ez,ϵN(0,I),t[ϵϵθ(zt,at,t)2]L_\text{diff} = \mathbb{E}_{z,\epsilon \sim \mathcal{N}(0,I),t} [\|\epsilon - \epsilon_\theta(z_t, a_t, t)\|^2]

Total loss: L=Lvae+λLdiffL = L_\text{vae} + \lambda L_\text{diff} (λ=1\lambda=1).

  • Efficiency: To accelerate RL rollouts, transformer attention keys/values are cached, reducing per-step compute by approximately 10×10\times.

The VLA policy πθ(ot,g)\pi_\theta(o_t, g) is a large transformer (7B parameters) built upon OpenVLA-OFT [Kim et al., 2025], comprising:

  • Visual encoder: CNN mapping 256×256 RGB images to tokens,
  • Language encoder: Pretrained LLAMA-2 for instruction gg,
  • Cross-modal fusion: Attention between image and language representations,
  • Action head: Autoregressive real-valued action outputs.

Policy initialization proceeds via supervised finetuning (SFT) from OpenVLA, further finetuned on expert BridgeData V2 trajectories for 20k steps.

2. Model-Based Reinforcement Learning in Imagination

World-Gymnast conducts fully model-based RL by rolling out πθ\pi_\theta in the world model and optimizing expected reward:

J(θ)=Eτpθ(τ)[t=0H1γtr(τ,g)]Eτk[r(τk,g)]J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{H-1} \gamma^t r(\tau, g) \right] \approx \mathbb{E}_{\tau_k} [r(\tau_k, g)]

where rr is binary (success/failure), evaluated trajectory-wide via a VLM.

Policy gradient optimization adopts Group Relative Policy Optimization (GRPO) [Shao et al., 2024]:

  • Generate K=8K=8 parallel rollouts τk\tau_k in the world model.
  • Each rollout is scored rk{0,1}r_k \in \{0,1\} by submitting sampled frames to GPT-4o, prompted with the current instruction gg.
  • Compute group reward mean μ\mu and std σ\sigma; normalize advantage per trajectory: Ak=(rkμ)/(σ+ϵ)A_k = (r_k - \mu)/(\sigma + \epsilon), assigned to all timesteps tt in τk\tau_k.
  • Minimize a clipped surrogate loss (PPO-style):

LCLIP(θ)=Ek,t[min(ρt,k(θ)Ak,  clip(ρt,k(θ),1ϵlow,1+ϵhigh)Ak)]L_\text{CLIP}(\theta) = -\mathbb{E}_{k,t}\left[ \min \left( \rho_{t,k}(\theta) A_k, \;\text{clip}(\rho_{t,k}(\theta), 1-\epsilon_\text{low}, 1+\epsilon_\text{high})A_k \right) \right]

with ρt,k(θ)=πθ(at,kot,k,g)/πθold()\rho_{t,k}(\theta) = \pi_\theta(a_{t,k} | o_{t,k}, g)/\pi_{\theta_\text{old}}(…). Hyperparameters: learning rate 5×1065 \times 10^{-6}; ϵlow=0.2\epsilon_\text{low}=0.2, ϵhigh=0.28\epsilon_\text{high}=0.28; H=40H=40; temperature 1.6.

The VLM reward uses GPT-4o prompted with sampled frames (downsampled by stride 3); the final binary score is a majority vote over 5 rollouts.

3. Empirical Evaluation and Baseline Comparison

World-Gymnast is evaluated on the Bridge robot—WidowX arm in drawer-cabinet and sink-counter scenes—with the AutoEval protocol Zhou et al., 2025. Baselines include SFT (only expert demonstration), Iter-SFT (two rounds with synthetic rollouts), and SIMPLER (simulator RL with dense reward) [Li et al., 2024].

Task SFT Success SIMPLER Success World-Gymnast Success
Open drawer 34±7% 58±4%
Close drawer 74±5% 62±6%
Put eggplant in blue sink 4±4% 32±10% 72±10%
Put eggplant in yellow basket 8±4% 40±10% 78±2%

World-Gymnast outperforms SFT by as much as 18× (blue sink) and simulator-based RL by up to 2×, with the caveat that close drawer is higher for SIMPLER (74%) than World-Gymnast (62%).

4. Extensions: Generalization, Adaptation, and Iterative Data-Flywheel

World-Gymnast exhibits several advanced capabilities:

  • Diverse Policy Training: RL can be initialized from arbitrary images o0o_0 within the world model's support, enabling recovery behavior learning; cluttered "distractor" scenes (WG-Distract) achieve 78±2% (vs. 74±3% original); OOD instructions (WG-Language) yield 81±1% success; additional BridgeData tasks (WG-Scaled) reach 81±4%.
  • Test-Time Adaptation: Given a novel test-scene frame o0o_0 and instruction gg, a brief RL episode in WorldGym yields rapid adaptation. Example: close-drawer improves from 62±6% to 100% success on a single test task. However, such adaptation currently overfits and does not generalize to new tasks without further research.
  • Iterative Improvement: Integrating real-robot rollouts, expanding the world model dataset, fine-tuning the world model, and then RL-finetuning πθ\pi_\theta further reduces sim-to-real artifacts and increases close-drawer real-world success to 95%.

5. Limitations and Open Challenges

Noted limitations include:

  • Distributional Shift: WorldGym rollouts degrade if o0o_0 lies outside the world model's training distribution, necessitating both broad pretraining and frequent real-robot data augmentation.
  • Reward Model Noise: GPT-4o can misclassify task success, introducing reward stochasticity. Mitigation requires research into learned reward models [Lee et al., 2026] and reward shaping.
  • Sparse Rewards: Binary success signals are susceptible to reward hacking and inefficiency; future work could leverage subgoal or hierarchical reward structures.
  • Safety: Hallucinated physics in the world model means learned policies may exploit model inaccuracies; real-world safety verification remains necessary, especially in critical applications.

6. Significance and Future Directions

World-Gymnast demonstrates that model-based RL in learned, video-conditioned world models—with VLM-based rewards and scalable vision-language-action policies—can substantially improve the transfer of policy performance to real robots. The approach allows training from broad real-world video-action logs (Open X-Embodiment), supports robust generalization, enables rapid test-time adaptation, and establishes a data flywheel for continual improvement.

A plausible implication is that cloud-based model learning and large-scale world model RL will become a standard paradigm for robotics, complementing or replacing traditional SFT and simulator-based RL approaches as model and reward fidelity increase. Remaining challenges include expanding world model generalization, reducing reward noise, automating safety verification, and enabling true multi-task test-time adaptation (Sharma et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to World-Gymnast.