ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

Published 3 Apr 2026 in cs.CV | (2604.02714v1)

Abstract: End-to-end autonomous driving models based on Vision-Language-Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out-of-distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding-and-generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory prediction with future RGB and depth image generation as dense world modeling objectives, requiring the model to learn fine-grained visual and geometric representations that substantially enrich the planning backbone. Beyond serving as a supervisory signal, the world model further acts as a source of intrinsic reward for policy exploration: its image prediction uncertainty naturally measures a trajectory's novelty relative to the training distribution, where high uncertainty indicates out-of-distribution scenarios that, if safe, represent valuable learning opportunities. We incorporate this exploration signal into a safety-gated reward and optimize the policy via Group Relative Policy Optimization (GRPO). Experiments on the NAVSIM and nuScenes benchmarks demonstrate the effectiveness of our approach, achieving a state-of-the-art PDMS score of 93.7 and an EPDMS of 88.8 on NAVSIM. The code and demo will be publicly available at https://zihaosheng.github.io/ExploreVLA/.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a unified VLA framework that combines dense RGB and depth predictions with trajectory forecasting to internalize detailed scene dynamics.
The method leverages an intrinsic exploration reward based on world model uncertainty to safely identify and promote novel, out-of-distribution trajectories via GRPO.
Experimental results show state-of-the-art performance on benchmarks, achieving high safety and planning metrics with a single-camera setup.

ExploreVLA: Dense World Modeling and Exploration for Autonomous Driving

Motivation and Contributions

ExploreVLA proposes a unified Vision-Language-Action (VLA) architecture targeting key deficiencies in end-to-end autonomous driving. Existing imitation learning paradigms, which dominate VLA-based models, fail to generalize under distributional shift due to their inability to discover alternative or novel driving strategies not observed in the expert data. Reinforcement Learning (RL) offers policy exploration but typical offline RL settings in autonomous driving lack direct access to environment state transitions, necessitating a learned world model. Moreover, most VLA architectures suffer from sparse supervision—relying mainly on trajectory waypoints and textual commands—which limits the richness of scene understanding.

ExploreVLA addresses these limitations through two core mechanisms:

Dense Supervision via World Modeling: Joint prediction of future RGB and depth images, alongside trajectory prediction, forces the model to internalize detailed visual and geometric information, enhancing planning fidelity.
Intrinsic Exploration Reward: The world model’s uncertainty in generating future images, conditioned on candidate trajectories, serves as an intrinsic reward signal that quantifies trajectory novelty and facilitates out-of-distribution exploration. This reward is safety-gated using PDMS to ensure only safe, non-colliding, novel trajectories are encouraged.

These contributions enable a unified framework for understanding and generation, and an RL-based optimization that expands behavioral repertoire beyond imitation. The composite reward is optimized using Group Relative Policy Optimization (GRPO), which ranks candidate trajectories within sampled groups and promotes both quality and novelty.

Architectural Design

The backbone is a transformer-based unified VLA architecture capable of multimodal tokenization and reasoning. Text and ego-status are processed via causal attention, while images utilize full attention. The output space combines:

Planned future waypoints (continuous)
Generated RGB images (discrete tokens)
Generated depth maps (discrete tokens)

Image tokenization leverages MAGVIT-v2 quantization, ensuring scalable efficient representation, compatible with LLM token processing. Trajectory prediction is decoupled from token vocabulary via a dedicated MLP, allowing precise action inference while maintaining shared contextual embeddings.

Dense World Modeling Objectives

RGB and depth image generation objectives provide dense pixel-level supervision, supplementing the sparse textual and waypoint signals. RGB generation enforces learning of fine-grained appearance, semantics, and object identity. Depth generation encodes spatial layout, metric geometry, and surface orientation essential for collision avoidance and trajectory feasibility. The mask token prediction objective ensures robust generative modeling, forcing the system to reconstruct masked regions, enhancing internalization of scene dynamics and complex spatial relationships.

Empirical ablation confirms that combining RGB and depth generation improves planning performance, with each modality contributing complementary information.

Exploration-Driven RL Post-Training

ExploreVLA’s post-training RL phase utilizes an exploration bonus derived from world model uncertainty (entropy of token prediction distributions) conditioned on sampled trajectories. This uncertainty reflects how far a candidate trajectory is from the expert distribution, thus identifying valuable out-of-distribution candidates.

To guarantee safety, exploration bonuses are gated by PDMS scores; only trajectories exceeding a threshold PDMS receive the bonus. The composite reward $R_i = \text{PDMS}_i + \lambda \cdot b_i$ (with bonus $b_i$ for high-uncertainty, high-safety trajectories) is then optimized via GRPO. Sampling groups of candidate trajectories and scoring them relative to each other enables robust policy improvement while balancing exploration and exploitation.

Experimental Results

ExploreVLA demonstrates state-of-the-art performance on NAVSIM v1 and v2 benchmarks, achieving PDMS of 93.7 and EPDMS of 88.8, respectively. Notably, it surpasses multi-sensor systems (including those using LiDAR and multiple cameras) while utilizing only a single front-view camera. It also outperforms prior dense world modeling approaches and VLA models that lack explicit exploration mechanisms.

Submetric analyses show consistently high scores on safety, comfort, collision avoidance, and lane compliance. The qualitative results reveal successful correction of safety-critical failures in challenging scenarios after the RL post-training phase, substantiating the practical effectiveness of the exploration-driven optimization.

On nuScenes, ExploreVLA yields competitive L2 errors and the lowest average collision rate among baselines under the ST-P3 protocol, confirming robust open-loop planning generalization.

Implications and Future Directions

This work advances the field by integrating dense world modeling into the supervisory pipeline and introducing image-based uncertainty as a principled exploration signal in RL post-training. This combination improves out-of-distribution generalization, safety, and planning performance.

Practically, the approach enables single-camera systems to match or outperform multimodal inputs, reducing hardware complexity and cost. Theoretically, it sets a precedent for leveraging generative uncertainty as intrinsic reward in RL for structured domains, providing a pathway for scalable, data-driven exploration in safety-critical applications.

Future extensions include expanding to multi-view camera setups for enhanced spatial coverage and diversified supervision (potentially incorporating BEV and semantic maps), and exploring closed-loop evaluation settings, which remain an open challenge. Further, integration of more advanced uncertainty quantification and exploration strategies—such as diffusion-based world modeling or intention-aware generation—could facilitate even more robust long-tail behavior.

Conclusion

ExploreVLA establishes a unified understanding-and-generation-driven paradigm for end-to-end autonomous driving, combining dense world modeling supervision with uncertainty-gated exploration rewards. With strong empirical performance and theoretical novelty, it positions world model-based RL as a critical technique for scalable, robust autonomous policy optimization, promising continued advancements as data diversity and architectural complexity grow (2604.02714).

Markdown Report Issue