WorldRFT: Autonomous Driving World Model
- WorldRFT is a latent world model for autonomous driving that integrates vision-geometry fusion, hierarchical planning, and reinforcement fine-tuning.
- It applies a local-aware, iterative refinement process to enhance trajectory precision and achieve significant collision rate reductions.
- Reinforcement fine-tuning with Group Relative Policy Optimization drives state-of-the-art performance on benchmarks like nuScenes and NavSim.
WorldRFT (World Reinforcement Fine-Tuning) is a planning-oriented latent world model framework for end-to-end autonomous driving that explicitly aligns representation learning with planning objectives. It is designed to overcome the limitations of reconstruction-centric latent world models, which often entangle perception and planning and thereby inhibit optimal policy development for safety-critical driving. WorldRFT achieves this by employing a vision-geometry foundation model for 3D spatial awareness, hierarchical planning decomposition, local-aware iterative refinement, and reinforcement learning fine-tuning with Group Relative Policy Optimization (GRPO). The system demonstrates state-of-the-art performance on both open-loop nuScenes and closed-loop NavSim benchmarks, achieving substantial improvements in trajectory accuracy and collision rate reduction (Yang et al., 22 Dec 2025).
1. Latent World Model Architecture
WorldRFT’s core is a perception annotation-free latent world model, constructed through temporal self-supervised learning and optimized for planning relevance.
Visual–Geometric Fusion
The input comprises surround-view images . These are processed into 2D visual features . A frozen Vision-Geometry Foundation Transformer (VGGT) yields multi-view consistent 3D tokens . Cross-attention ($\mathrm{C\mbox{-}A}$) between and produces the spatial-aware latent world state .
World Transition
A compact world decoder models temporal dynamics by predicting the next latent state , conditioned on the current latent state and trajectory, enforced by a mean squared error reconstruction loss. This self-supervised transition mechanism ensures temporal consistency within the latent manifold.
Self-Supervised and Imitation Losses
Semantic alignment is facilitated through cross-entropy on pseudo-masks from a vision–LLM (Grounded-SAM). Hierarchical planning supervision employs Laplace negative log-likelihood for target region prediction and L1 losses for trajectory and path supervision. The overall pre-training objective is a weighted sum of semantic, reconstruction, target, and trajectory losses.
2. Hierarchical Planning Decomposition
WorldRFT decomposes planning into three hierarchical sub-tasks:
Task Definition
- High-level: Target region localization captures terminal goal uncertainty via Laplace () parameterization.
- Mid-level: Spatial path planning produces a sequence of spatial waypoints .
- Low-level: Temporal trajectory planning yields timestamped points .
Query-Based Feature Extraction
Hierarchical queries (, , ) are fused with the latent world state through cross-attention and self-attention, resulting in task-specific feature vectors .
Prediction Heads
Multi-layer perceptrons generate target region estimates, paths, and trajectories from the attended features, supporting structured planning grounded in latent state representation.
3. Local-Aware Iterative Refinement
WorldRFT utilizes a local-aware, multi-step refinement process to improve the precision of its planning outputs.
- Initialization: Initial predictions of are generated.
Iterative Update: For each of iterations:
- Encode the global plan state.
- Project trajectory points into camera space.
- Sample local features around the projected trajectory points using deformable convolution on the latent representation.
- Fuse local context, query features, and uncertainty measures using an MLP.
- Generate residual trajectory updates, which are applied to refine .
This mechanism adaptively integrates both global and local context, leveraging perceptual hints from world-state latents to yield smooth, spatially-consistent driving trajectories.
4. Reinforcement Fine-Tuning with Group Relative Policy Optimization
Safety-critical driving policies are enhanced by fine-tuning with GRPO, an RL method tailored for sample efficiency and collision reduction.
Gaussianized Trajectory Policy
The output trajectory policy is parameterized as a multivariate Gaussian, with the mean predicted by the network and covariance from a variance head.
Collision-Aware Reward
At each trajectory step, the reward is for a collision event and $0$ otherwise, focusing optimization on safety-critical behaviors.
Group-Relative Normalization
Within minibatches of trajectories, rewards are group-normalized and advantages are computed over future steps, standardizing the learning signal across batch diversity.
GRPO Objective
The RL objective employs clipped policy ratios and a KL-penalty to a reference distribution, encouraging stable yet responsive policy updates. The final loss combines RL with a scaled KL penalty.
5. Empirical Evaluation and Benchmarks
WorldRFT’s effectiveness is validated on industry-standard benchmarks.
nuScenes (Open-Loop)
| Model | Avg. L2 (m) | Collision Rate (%) |
|---|---|---|
| LAW | 0.61 | 0.30 |
| WorldRFT | 0.48 | 0.05 |
Ablation studies show the effect of RL fine-tuning: without RFT, collision rate is 0.15%; with RFT, it is 0.05%. This constitutes an 83% reduction in collision rates and a 21% reduction in trajectory error compared to prior models.
NavSim (Closed-Loop)
Performance is measured using the PDMS metric, integrating no-fault collision rate, drivable-area compliance, ego-progress, time-to-collision, and ride comfort.
| Model | PDMS Score |
|---|---|
| LAW | 84.6 |
| WorldRFT (camera) | 87.8 |
| DiffusionDrive (LiDAR) | 88.1 |
WorldRFT, using only camera inputs, matches the LiDAR-based DiffusionDrive within a 0.3 margin, and outperforms preceding latent world models.
6. Significance and Context
WorldRFT establishes a new direction for structured latent world modeling in end-to-end planning, prioritizing planning objectives in representation learning and introducing a practical, safety-focused reinforcement fine-tuning regimen. The integration of VGGT for spatial awareness, explicit planning decomposition, and iterative, local-aware refinement distinguishes it from previous approaches, which have been limited by perceptually-biased representation learning. The method's ability to achieve state-of-the-art collision avoidance with purely vision-based inputs suggests a path forward for scalable, annotation-free autonomous driving systems (Yang et al., 22 Dec 2025).
A plausible implication is that WorldRFT’s modular decomposition and RL-based fine-tuning principles could generalize to other embodied agent domains. Related research, such as VLA-RFT, extends world-model-based RFT to vision-language-action agents in simulation environments, further validating the centrality of world-model-guided RL across domains (Li et al., 1 Oct 2025).