- The paper introduces a unified VLA framework that combines dense RGB and depth predictions with trajectory forecasting to internalize detailed scene dynamics.
- The method leverages an intrinsic exploration reward based on world model uncertainty to safely identify and promote novel, out-of-distribution trajectories via GRPO.
- Experimental results show state-of-the-art performance on benchmarks, achieving high safety and planning metrics with a single-camera setup.
ExploreVLA: Dense World Modeling and Exploration for Autonomous Driving
Motivation and Contributions
ExploreVLA proposes a unified Vision-Language-Action (VLA) architecture targeting key deficiencies in end-to-end autonomous driving. Existing imitation learning paradigms, which dominate VLA-based models, fail to generalize under distributional shift due to their inability to discover alternative or novel driving strategies not observed in the expert data. Reinforcement Learning (RL) offers policy exploration but typical offline RL settings in autonomous driving lack direct access to environment state transitions, necessitating a learned world model. Moreover, most VLA architectures suffer from sparse supervisionโrelying mainly on trajectory waypoints and textual commandsโwhich limits the richness of scene understanding.
ExploreVLA addresses these limitations through two core mechanisms:
- Dense Supervision via World Modeling: Joint prediction of future RGB and depth images, alongside trajectory prediction, forces the model to internalize detailed visual and geometric information, enhancing planning fidelity.
- Intrinsic Exploration Reward: The world modelโs uncertainty in generating future images, conditioned on candidate trajectories, serves as an intrinsic reward signal that quantifies trajectory novelty and facilitates out-of-distribution exploration. This reward is safety-gated using PDMS to ensure only safe, non-colliding, novel trajectories are encouraged.
These contributions enable a unified framework for understanding and generation, and an RL-based optimization that expands behavioral repertoire beyond imitation. The composite reward is optimized using Group Relative Policy Optimization (GRPO), which ranks candidate trajectories within sampled groups and promotes both quality and novelty.
Architectural Design
The backbone is a transformer-based unified VLA architecture capable of multimodal tokenization and reasoning. Text and ego-status are processed via causal attention, while images utilize full attention. The output space combines:
- Planned future waypoints (continuous)
- Generated RGB images (discrete tokens)
- Generated depth maps (discrete tokens)
Image tokenization leverages MAGVIT-v2 quantization, ensuring scalable efficient representation, compatible with LLM token processing. Trajectory prediction is decoupled from token vocabulary via a dedicated MLP, allowing precise action inference while maintaining shared contextual embeddings.
Dense World Modeling Objectives
RGB and depth image generation objectives provide dense pixel-level supervision, supplementing the sparse textual and waypoint signals. RGB generation enforces learning of fine-grained appearance, semantics, and object identity. Depth generation encodes spatial layout, metric geometry, and surface orientation essential for collision avoidance and trajectory feasibility. The mask token prediction objective ensures robust generative modeling, forcing the system to reconstruct masked regions, enhancing internalization of scene dynamics and complex spatial relationships.
Empirical ablation confirms that combining RGB and depth generation improves planning performance, with each modality contributing complementary information.
Exploration-Driven RL Post-Training
ExploreVLAโs post-training RL phase utilizes an exploration bonus derived from world model uncertainty (entropy of token prediction distributions) conditioned on sampled trajectories. This uncertainty reflects how far a candidate trajectory is from the expert distribution, thus identifying valuable out-of-distribution candidates.
To guarantee safety, exploration bonuses are gated by PDMS scores; only trajectories exceeding a threshold PDMS receive the bonus. The composite reward Riโ=PDMSiโ+ฮปโ
biโ (with bonus biโ for high-uncertainty, high-safety trajectories) is then optimized via GRPO. Sampling groups of candidate trajectories and scoring them relative to each other enables robust policy improvement while balancing exploration and exploitation.
Experimental Results
ExploreVLA demonstrates state-of-the-art performance on NAVSIM v1 and v2 benchmarks, achieving PDMS of 93.7 and EPDMS of 88.8, respectively. Notably, it surpasses multi-sensor systems (including those using LiDAR and multiple cameras) while utilizing only a single front-view camera. It also outperforms prior dense world modeling approaches and VLA models that lack explicit exploration mechanisms.
Submetric analyses show consistently high scores on safety, comfort, collision avoidance, and lane compliance. The qualitative results reveal successful correction of safety-critical failures in challenging scenarios after the RL post-training phase, substantiating the practical effectiveness of the exploration-driven optimization.
On nuScenes, ExploreVLA yields competitive L2 errors and the lowest average collision rate among baselines under the ST-P3 protocol, confirming robust open-loop planning generalization.
Implications and Future Directions
This work advances the field by integrating dense world modeling into the supervisory pipeline and introducing image-based uncertainty as a principled exploration signal in RL post-training. This combination improves out-of-distribution generalization, safety, and planning performance.
Practically, the approach enables single-camera systems to match or outperform multimodal inputs, reducing hardware complexity and cost. Theoretically, it sets a precedent for leveraging generative uncertainty as intrinsic reward in RL for structured domains, providing a pathway for scalable, data-driven exploration in safety-critical applications.
Future extensions include expanding to multi-view camera setups for enhanced spatial coverage and diversified supervision (potentially incorporating BEV and semantic maps), and exploring closed-loop evaluation settings, which remain an open challenge. Further, integration of more advanced uncertainty quantification and exploration strategiesโsuch as diffusion-based world modeling or intention-aware generationโcould facilitate even more robust long-tail behavior.
Conclusion
ExploreVLA establishes a unified understanding-and-generation-driven paradigm for end-to-end autonomous driving, combining dense world modeling supervision with uncertainty-gated exploration rewards. With strong empirical performance and theoretical novelty, it positions world model-based RL as a critical technique for scalable, robust autonomous policy optimization, promising continued advancements as data diversity and architectural complexity grow (2604.02714).