Papers
Topics
Authors
Recent
Search
2000 character limit reached

WorldRFT: Autonomous Driving World Model

Updated 29 December 2025
  • WorldRFT is a latent world model for autonomous driving that integrates vision-geometry fusion, hierarchical planning, and reinforcement fine-tuning.
  • It applies a local-aware, iterative refinement process to enhance trajectory precision and achieve significant collision rate reductions.
  • Reinforcement fine-tuning with Group Relative Policy Optimization drives state-of-the-art performance on benchmarks like nuScenes and NavSim.

WorldRFT (World Reinforcement Fine-Tuning) is a planning-oriented latent world model framework for end-to-end autonomous driving that explicitly aligns representation learning with planning objectives. It is designed to overcome the limitations of reconstruction-centric latent world models, which often entangle perception and planning and thereby inhibit optimal policy development for safety-critical driving. WorldRFT achieves this by employing a vision-geometry foundation model for 3D spatial awareness, hierarchical planning decomposition, local-aware iterative refinement, and reinforcement learning fine-tuning with Group Relative Policy Optimization (GRPO). The system demonstrates state-of-the-art performance on both open-loop nuScenes and closed-loop NavSim benchmarks, achieving substantial improvements in trajectory accuracy and collision rate reduction (Yang et al., 22 Dec 2025).

1. Latent World Model Architecture

WorldRFT’s core is a perception annotation-free latent world model, constructed through temporal self-supervised learning and optimized for planning relevance.

Visual–Geometric Fusion

The input comprises surround-view images It∈RM×H×W×3I_t \in \mathbb{R}^{M\times H\times W\times 3}. These are processed into 2D visual features Ft=Backbone(It)∈RM×h×w×DF_t = \mathrm{Backbone}(I_t) \in \mathbb{R}^{M\times h\times w\times D}. A frozen Vision-Geometry Foundation Transformer (VGGT) yields multi-view consistent 3D tokens t3Dt_{3D}. Cross-attention ($\mathrm{C\mbox{-}A}$) between FtF_t and t3Dt_{3D} produces the spatial-aware latent world state WlatenttW^t_\mathrm{latent}.

World Transition

A compact world decoder TÏ•\mathcal{T}_\phi models temporal dynamics by predicting the next latent state W^latentt+1\widehat{W}^{t+1}_\mathrm{latent}, conditioned on the current latent state and trajectory, enforced by a mean squared error reconstruction loss. This self-supervised transition mechanism ensures temporal consistency within the latent manifold.

Self-Supervised and Imitation Losses

Semantic alignment is facilitated through cross-entropy on pseudo-masks from a vision–LLM (Grounded-SAM). Hierarchical planning supervision employs Laplace negative log-likelihood for target region prediction and L1 losses for trajectory and path supervision. The overall pre-training objective is a weighted sum of semantic, reconstruction, target, and trajectory losses.

2. Hierarchical Planning Decomposition

WorldRFT decomposes planning into three hierarchical sub-tasks:

Task Definition

  • High-level: Target region localization captures terminal goal uncertainty via Laplace (μ,b\mu,b) parameterization.
  • Mid-level: Spatial path planning produces a sequence of spatial waypoints Tpath∈RN×2T_\mathrm{path} \in \mathbb{R}^{N\times2}.
  • Low-level: Temporal trajectory planning yields timestamped points Ttraj∈RT×2T_\mathrm{traj} \in \mathbb{R}^{T\times2}.

Query-Based Feature Extraction

Hierarchical queries (QtargetQ_\mathrm{target}, QpathQ_\mathrm{path}, QtrajQ_\mathrm{traj}) are fused with the latent world state through cross-attention and self-attention, resulting in task-specific feature vectors Q′′Q''.

Prediction Heads

Multi-layer perceptrons generate target region estimates, paths, and trajectories from the attended features, supporting structured planning grounded in latent state representation.

3. Local-Aware Iterative Refinement

WorldRFT utilizes a local-aware, multi-step refinement process to improve the precision of its planning outputs.

  • Initialization: Initial predictions of (μ,b),Tpath,Ttraj(\mu, b), T_\mathrm{path}, T_\mathrm{traj} are generated.
  • Iterative Update: For each of KK iterations:

    1. Encode the global plan state.
    2. Project trajectory points into camera space.
    3. Sample local features around the projected trajectory points using deformable convolution on the latent representation.
    4. Fuse local context, query features, and uncertainty measures using an MLP.
    5. Generate residual trajectory updates, which are applied to refine TtrajT_\mathrm{traj}.

This mechanism adaptively integrates both global and local context, leveraging perceptual hints from world-state latents to yield smooth, spatially-consistent driving trajectories.

4. Reinforcement Fine-Tuning with Group Relative Policy Optimization

Safety-critical driving policies are enhanced by fine-tuning with GRPO, an RL method tailored for sample efficiency and collision reduction.

Gaussianized Trajectory Policy

The output trajectory policy πθ\pi_\theta is parameterized as a multivariate Gaussian, with the mean predicted by the network and covariance from a variance head.

Collision-Aware Reward

At each trajectory step, the reward is −1-1 for a collision event and $0$ otherwise, focusing optimization on safety-critical behaviors.

Group-Relative Normalization

Within minibatches of GG trajectories, rewards are group-normalized and advantages are computed over future steps, standardizing the learning signal across batch diversity.

GRPO Objective

The RL objective employs clipped policy ratios and a KL-penalty to a reference distribution, encouraging stable yet responsive policy updates. The final loss combines RL with a scaled KL penalty.

5. Empirical Evaluation and Benchmarks

WorldRFT’s effectiveness is validated on industry-standard benchmarks.

nuScenes (Open-Loop)

Model Avg. L2 (m) Collision Rate (%)
LAW 0.61 0.30
WorldRFT 0.48 0.05

Ablation studies show the effect of RL fine-tuning: without RFT, collision rate is 0.15%; with RFT, it is 0.05%. This constitutes an 83% reduction in collision rates and a 21% reduction in trajectory error compared to prior models.

Performance is measured using the PDMS metric, integrating no-fault collision rate, drivable-area compliance, ego-progress, time-to-collision, and ride comfort.

Model PDMS Score
LAW 84.6
WorldRFT (camera) 87.8
DiffusionDrive (LiDAR) 88.1

WorldRFT, using only camera inputs, matches the LiDAR-based DiffusionDrive within a 0.3 margin, and outperforms preceding latent world models.

6. Significance and Context

WorldRFT establishes a new direction for structured latent world modeling in end-to-end planning, prioritizing planning objectives in representation learning and introducing a practical, safety-focused reinforcement fine-tuning regimen. The integration of VGGT for spatial awareness, explicit planning decomposition, and iterative, local-aware refinement distinguishes it from previous approaches, which have been limited by perceptually-biased representation learning. The method's ability to achieve state-of-the-art collision avoidance with purely vision-based inputs suggests a path forward for scalable, annotation-free autonomous driving systems (Yang et al., 22 Dec 2025).

A plausible implication is that WorldRFT’s modular decomposition and RL-based fine-tuning principles could generalize to other embodied agent domains. Related research, such as VLA-RFT, extends world-model-based RFT to vision-language-action agents in simulation environments, further validating the centrality of world-model-guided RL across domains (Li et al., 1 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WorldRFT.