WorldRFT: Autonomous Driving World Model

Updated 29 December 2025

WorldRFT is a latent world model for autonomous driving that integrates vision-geometry fusion, hierarchical planning, and reinforcement fine-tuning.
It applies a local-aware, iterative refinement process to enhance trajectory precision and achieve significant collision rate reductions.
Reinforcement fine-tuning with Group Relative Policy Optimization drives state-of-the-art performance on benchmarks like nuScenes and NavSim.

WorldRFT (World Reinforcement Fine-Tuning) is a planning-oriented latent world model framework for end-to-end autonomous driving that explicitly aligns representation learning with planning objectives. It is designed to overcome the limitations of reconstruction-centric latent world models, which often entangle perception and planning and thereby inhibit optimal policy development for safety-critical driving. WorldRFT achieves this by employing a vision-geometry foundation model for 3D spatial awareness, hierarchical planning decomposition, local-aware iterative refinement, and reinforcement learning fine-tuning with Group Relative Policy Optimization (GRPO). The system demonstrates state-of-the-art performance on both open-loop nuScenes and closed-loop NavSim benchmarks, achieving substantial improvements in trajectory accuracy and collision rate reduction (Yang et al., 22 Dec 2025).

1. Latent World Model Architecture

WorldRFT’s core is a perception annotation-free latent world model, constructed through temporal self-supervised learning and optimized for planning relevance.

Visual–Geometric Fusion

The input comprises surround-view images $I_t \in \mathbb{R}^{M\times H\times W\times 3}$ . These are processed into 2D visual features $F_t = \mathrm{Backbone}(I_t) \in \mathbb{R}^{M\times h\times w\times D}$ . A frozen Vision-Geometry Foundation Transformer (VGGT) yields multi-view consistent 3D tokens $t_{3D}$ . Cross-attention ($\mathrm{C\mbox{-}A}$) between $F_t$ and $t_{3D}$ produces the spatial-aware latent world state $W^t_\mathrm{latent}$ .

World Transition

A compact world decoder $\mathcal{T}_\phi$ models temporal dynamics by predicting the next latent state $\widehat{W}^{t+1}_\mathrm{latent}$ , conditioned on the current latent state and trajectory, enforced by a mean squared error reconstruction loss. This self-supervised transition mechanism ensures temporal consistency within the latent manifold.

Self-Supervised and Imitation Losses

Semantic alignment is facilitated through cross-entropy on pseudo-masks from a vision–LLM (Grounded-SAM). Hierarchical planning supervision employs Laplace negative log-likelihood for target region prediction and L1 losses for trajectory and path supervision. The overall pre-training objective is a weighted sum of semantic, reconstruction, target, and trajectory losses.

2. Hierarchical Planning Decomposition

WorldRFT decomposes planning into three hierarchical sub-tasks:

Task Definition

High-level: Target region localization captures terminal goal uncertainty via Laplace ( $\mu,b$ ) parameterization.
Mid-level: Spatial path planning produces a sequence of spatial waypoints $T_\mathrm{path} \in \mathbb{R}^{N\times2}$ .
Low-level: Temporal trajectory planning yields timestamped points $T_\mathrm{traj} \in \mathbb{R}^{T\times2}$ .

Query-Based Feature Extraction

Hierarchical queries ( $Q_\mathrm{target}$ , $Q_\mathrm{path}$ , $Q_\mathrm{traj}$ ) are fused with the latent world state through cross-attention and self-attention, resulting in task-specific feature vectors $Q''$ .

Prediction Heads

Multi-layer perceptrons generate target region estimates, paths, and trajectories from the attended features, supporting structured planning grounded in latent state representation.

WorldRFT utilizes a local-aware, multi-step refinement process to improve the precision of its planning outputs.

Initialization: Initial predictions of $(\mu, b), T_\mathrm{path}, T_\mathrm{traj}$ are generated.
Iterative Update: For each of $K$ iterations:
1. Encode the global plan state.
2. Project trajectory points into camera space.
3. Sample local features around the projected trajectory points using deformable convolution on the latent representation.
4. Fuse local context, query features, and uncertainty measures using an MLP.
5. Generate residual trajectory updates, which are applied to refine $T_\mathrm{traj}$ .

This mechanism adaptively integrates both global and local context, leveraging perceptual hints from world-state latents to yield smooth, spatially-consistent driving trajectories.

4. Reinforcement Fine-Tuning with Group Relative Policy Optimization

Safety-critical driving policies are enhanced by fine-tuning with GRPO, an RL method tailored for sample efficiency and collision reduction.

Gaussianized Trajectory Policy

The output trajectory policy $\pi_\theta$ is parameterized as a multivariate Gaussian, with the mean predicted by the network and covariance from a variance head.

Collision-Aware Reward

At each trajectory step, the reward is $-1$ for a collision event and $0$ otherwise, focusing optimization on safety-critical behaviors.

Group-Relative Normalization

Within minibatches of $G$ trajectories, rewards are group-normalized and advantages are computed over future steps, standardizing the learning signal across batch diversity.

GRPO Objective

The RL objective employs clipped policy ratios and a KL-penalty to a reference distribution, encouraging stable yet responsive policy updates. The final loss combines RL with a scaled KL penalty.

5. Empirical Evaluation and Benchmarks

WorldRFT’s effectiveness is validated on industry-standard benchmarks.

nuScenes (Open-Loop)

Model	Avg. L2 (m)	Collision Rate (%)
LAW	0.61	0.30
WorldRFT	0.48	0.05

Ablation studies show the effect of RL fine-tuning: without RFT, collision rate is 0.15%; with RFT, it is 0.05%. This constitutes an 83% reduction in collision rates and a 21% reduction in trajectory error compared to prior models.

NavSim (Closed-Loop)

Performance is measured using the PDMS metric, integrating no-fault collision rate, drivable-area compliance, ego-progress, time-to-collision, and ride comfort.

Model	PDMS Score
LAW	84.6
WorldRFT (camera)	87.8
DiffusionDrive (LiDAR)	88.1

WorldRFT, using only camera inputs, matches the LiDAR-based DiffusionDrive within a 0.3 margin, and outperforms preceding latent world models.

6. Significance and Context

WorldRFT establishes a new direction for structured latent world modeling in end-to-end planning, prioritizing planning objectives in representation learning and introducing a practical, safety-focused reinforcement fine-tuning regimen. The integration of VGGT for spatial awareness, explicit planning decomposition, and iterative, local-aware refinement distinguishes it from previous approaches, which have been limited by perceptually-biased representation learning. The method's ability to achieve state-of-the-art collision avoidance with purely vision-based inputs suggests a path forward for scalable, annotation-free autonomous driving systems (Yang et al., 22 Dec 2025).

A plausible implication is that WorldRFT’s modular decomposition and RL-based fine-tuning principles could generalize to other embodied agent domains. Related research, such as VLA-RFT, extends world-model-based RFT to vision-language-action agents in simulation environments, further validating the centrality of world-model-guided RL across domains (Li et al., 1 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (2)

WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving (2025)

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WorldRFT.

WorldRFT: Autonomous Driving World Model

1. Latent World Model Architecture

Visual–Geometric Fusion

World Transition

Self-Supervised and Imitation Losses

2. Hierarchical Planning Decomposition

Task Definition

Query-Based Feature Extraction

Prediction Heads

3. Local-Aware Iterative Refinement

4. Reinforcement Fine-Tuning with Group Relative Policy Optimization

Gaussianized Trajectory Policy

Collision-Aware Reward

Group-Relative Normalization

GRPO Objective

5. Empirical Evaluation and Benchmarks

nuScenes (Open-Loop)

NavSim (Closed-Loop)

6. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

WorldRFT: Autonomous Driving World Model

1. Latent World Model Architecture

Visual–Geometric Fusion

World Transition

Self-Supervised and Imitation Losses

2. Hierarchical Planning Decomposition

Task Definition

Query-Based Feature Extraction

Prediction Heads

3. Local-Aware Iterative Refinement

4. Reinforcement Fine-Tuning with Group Relative Policy Optimization

Gaussianized Trajectory Policy

Collision-Aware Reward

Group-Relative Normalization

GRPO Objective

5. Empirical Evaluation and Benchmarks

nuScenes (Open-Loop)

NavSim (Closed-Loop)

6. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics