DriveVA: Video Action Models are Zero-Shot Drivers

Published 5 Apr 2026 in cs.CV and cs.RO | (2604.04198v1)

Abstract: Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces DriveVA, a unified generative world model that simultaneously predicts future video and action trajectories for robust zero-shot autonomous driving.
It employs a diffusion transformer backbone and conditional flow matching to ensure tight spatiotemporal alignment between imagined scenes and planned maneuvers.
Empirical evaluations demonstrate significant reductions in L2 error and collision rates under cross-domain conditions, outperforming existing baselines in closed-loop scenarios.

Unified Video-Action World Models for Robust Zero-Shot Autonomous Driving: DriveVA

Introduction and Motivation

End-to-end generalization in autonomous driving systems across novel scenarios, diverse sensor domains, and environmental variations is a central challenge, largely due to the rarity and unpredictability of long-tail driving events and domain shift. Prevailing Vision-Language-Action (VLA) approaches leverage large-scale vision-language priors but are limited by their predominantly static-image pretraining and the lack of robust spatiotemporal modeling, resulting in suboptimal transfer to closed-loop planning and impaired zero-shot robustness. World model-based planning methods provide multi-modal future scene prediction and planning capabilities but typically employ decoupled visual imagination and trajectory generation. This architectural disconnect leads to poor alignment between generated videos and predicted actions, undermining planning reliability under distribution shifts.

In "DriveVA: Video Action Models are Zero-Shot Drivers" (2604.04198), a unified generative world model is introduced, referred to as DriveVA, which enables joint decoding of future visual and action rollouts, thereby aligning planned ego motion with generated future scenes and enhancing transfer and robustness for autonomous driving.

Model Architecture and Methodology

DriveVA leverages a DiT-based (Diffusion Transformer) backbone, initializing with spatiotemporal priors from a large-scale video generation model (Wan2.2-TI2V-5B), and integrates text and visual modalities in a causal VAE latent space to enable joint video-trajectory rollout for planning.

Figure 1: The DriveVA pipeline encodes historical observations, current velocities, and language instructions into a joint latent representation, enabling unified video and action generation via a diffusion transformer.

The model receives as conditional input a fixed-length buffer of previous egocentric video frames, language instructions, and instantaneous ego state (velocity components) (see Figure 1). History observations are encoded with the video VAE into latent tokens; the language instruction is encoded by a frozen text encoder. The DiT decoder jointly predicts the future video latent sequence and a sequence of action tokens corresponding to planned ego trajectory. This unified generative process ensures that the predicted trajectory is directly grounded in the visual scene evolution anticipated by the model, as opposed to being optimized in a sequential or loosely-coupled pipeline.

Key design aspects:

Joint Video-Action Latent Space: Actions (3D ego pose: $x, y, \text{yaw}$ ) are tokenized and temporally aligned with the predicted video clip in a shared latent space, enforcing tight semantic and physical consistency between visual imagination and trajectory.
Conditional Flow Matching: Training uses continuous flow matching to map noise to future video and action sequences, conditioned on historical context, ego state, and language.
Progressive Video Continuation: For long-horizon consistency, DriveVA rolls out future video and action sequences in a rolling window fashion, continually conditioning on freshly generated data.
Figure 2: DriveVA produces future video and trajectory rollouts in lockstep; compared to PWM, it shows superior zero-shot generalization and consistency under cross-dataset and cross-domain shifts.

Empirical Evaluation

DriveVA is benchmarked in both closed-loop (NAVSIM) and zero-shot cross-domain (nuScenes, Bench2Drive/CARLA) scenarios. Key quantitative results:

On NAVSIM, DriveVA achieves a PDMS score of 90.9, outstripping both traditional end-to-end and state-of-the-art world model baselines, despite using only front-view images as input (outperforming multimodal baselines that incorporate LiDAR and additional views).
Zero-shot transfer: When trained solely on NAVSIM and directly evaluated on nuScenes and Bench2Drive without target domain finetuning, DriveVA reduces average L2 error by 78.9% and collision rate by 83.3% (nuScenes) and by 52.5% and 52.4% (Bench2Drive) relative to PWM [zhao2025forecasting], the leading world model-planner baseline. In several cases, DriveVA surpasses in-domain finetuned baselines on nuScenes, an assertive result indicating strong intrinsic generalization from unified video-action modeling even in starkly out-of-distribution conditions.
Figure 3: In zero-shot transfer, DriveVA-predicted trajectories remain geometrically consistent with generated scene evolution, unlike PWM, which exhibits video-action mismatch (e.g., planning straight while imagining left-turn).

Figure 4: Zero-shot generalization on nuScenes—DriveVA's trajectories remain consistently aligned with anticipated visual evolution, even when evaluated under significant distributional shifts.

Figure 5: Zero-shot generalization on Bench2Drive (CARLA)—strong trajectory-video alignment is preserved, illustrating retained planning coherence under sim2real and real2sim challenges.

Figure 6: DPVO trajectory reconstructions from generated videos verify high video-action consistency in DriveVA rollouts for diverse maneuvers.

Video-Trajectory Consistency and Ablations

Central to DriveVA's performance is the enforcement of tight video-trajectory consistency throughout both training and inference. External verification with DPVO (Deep Patch Visual Odometry) demonstrates that ego trajectories, reconstructed from generated videos alone, closely match the planned trajectories produced by DriveVA, indicating that the system’s scene imagination and motion planning are geometrically and causally integrated.

Figure 7: Predicted ego trajectories strictly follow the semantic and physical cues in generated future frames, revealing coherence in long-horizon planning.

Video supervision proves essential: ablating the video loss or severing the coupling between video continuation and trajectory in rollout substantially degrades closed-loop metrics. The result underscores the necessity of dense temporal video grounding in world model-based planning and calls into question the efficacy of prior VLA models that treat video or visual context as mere auxiliary signals.

Figure 8: Failure modes illustrate that even when the predicted future diverges from ground truth (due to, e.g., prediction errors), trajectories remain consistent with imagined videos, confirming the strict coupling achieved by the unified rollout.

Comparison to Prior and Concurrent Approaches

DriveVA situates itself amidst a corpus of recent work advancing video world models, generative world simulation, and visual policy learning. Notable differences and strengths include:

Unified Decoding: Unlike cascaded or branched approaches (e.g., DriveVLA, PWM [li2025drivevlaw0, zhao2025forecasting]), which decouple video imagination from action, DriveVA’s DiT backbone directly and jointly decodes both, eliminating video-action mismatch (see Figure 3).
Efficient Closed-Loop Inference: Near-optimal planning performance is observed in as few as two flow-matching sampling steps at each policy update, combining computational efficiency and planning fidelity.
Simulation-Enhanced Scalability: Joint training with simulation data (e.g., CARLA) further improves real-world generalization, highlighting the ability of synthetic corner cases to augment transferable physical priors in large-scale world models.
Broader Implications: Experimental results confirm recent theoretical conjectures regarding the transfer benefits of spatiotemporal video priors over static vision-language foundations, particularly for scaling world model pretraining in cross-domain embodied AI [wiedemer2025video, yang2024cogvideox, brooks2024video, wan2025].

Implications and Future Directions

By empirically demonstrating that dense video-level supervision and unified video-action decoding significantly enhance closed-loop robustness and zero-shot transfer in autonomous driving, DriveVA motivates several theoretical and practical trajectories:

World Model Scaling Laws: Results indicate that generalization benefits scale with model size and video pretraining corpus richness—a subject of active investigation for autonomous driving and embodied AI [kaplan2020scaling, li2025drivevla, wan2025]. Full end-to-end fine-tuning is preferred over partial adaptation (e.g., LoRA) to maximize transfer.
Unified Planning Architectures: The demonstrated necessity of enforcing video-action consistency throughout rollout points to a paradigm shift in world model design, with consequences for both embodied simulation (e.g., robot manipulation, synthetic world forecasting) and policy learning (e.g., RL with world models, causal forecasting).
Video Models as Universal Policies: The results support the stronger assertion that sufficiently expressive generative video models, when equipped with action-grounding, may serve as universal zero-shot policies across visually rich agent domains—a direction aligned with current trends in video-predictive, multi-task, and universal world models [ye2026world, liang2025video, kim2026cosmos, brooks2024video].

Conclusion

DriveVA delivers a methodologically rigorous demonstration of unified video-action world modeling for autonomous driving, providing strong empirical evidence that tight video-trajectory coupling and dense visual supervision underpin robust zero-shot generalization. Its architectural and empirical advances indicate that joint generative modeling of video and action trajectories, scaled with large video priors and paired with efficient rollout, is a critical prerequisite for closing the real-to-sim, sim-to-real, and cross-domain generalization gaps in autonomous planning. These insights are directly relevant for the design of future end-to-end embodied agents, world model pretraining regimes, and large-scale policy deployment (2604.04198).

Markdown Report Issue