- The paper introduces ResAD, a framework that decomposes trajectory prediction into an inertial reference and residual deviations to mitigate causal confusion and planning horizon issues.
- It employs point-wise residual normalization and diffusion-based planning, achieving state-of-the-art results on NAVSIM benchmarks with improved PDMS and safety metrics.
- The approach integrates multi-modal sensor fusion and a trajectory ranker, enhancing contextual planning and offering generalizable benefits for end-to-end autonomous driving.
Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
End-to-end autonomous driving (E2EAD) systems have shifted the paradigm from modular pipelines to unified models that directly map sensor data to future trajectories. However, direct trajectory prediction is fundamentally challenged by spatio-temporal non-uniformity in trajectory data, leading to two critical issues: causal confusion and the planning horizon dilemma. Causal confusion arises when models learn spurious correlations rather than true causal relationships, while the planning horizon dilemma refers to the dominance of uncertain, long-horizon errors in the optimization process, which can compromise immediate safety.
The ResAD framework introduces Normalized Residual Trajectory Modeling to address these issues. Instead of predicting the trajectory outright, ResAD decomposes the task into predicting the residual deviation from a deterministic inertial reference, which is extrapolated from the ego-vehicle’s current state. This inertial reference acts as a counterfactual baseline, compelling the model to focus on the causal factors necessitating deviation (e.g., obstacles, traffic rules) rather than mere statistical patterns.
Figure 1: (a) Raw trajectory distributions show mean drift and increasing variance; residual modeling centers the distribution, and point-wise normalization further stabilizes variance. (b) ResAD uses an inertial reference as a baseline, forcing the model to learn causal deviations rather than statistical correlations.
Architecture and Implementation
ResAD employs a multi-modal sensor fusion backbone, integrating multi-view camera images and LiDAR data via a feature interaction encoder. The inertial reference is generated from the ego-vehicle’s state and perturbed into a cluster to ensure robustness and enable multi-modal predictions. Diffusion decoders, conditioned on this reference cluster, merge the encoded features through cross-attention to output planned trajectories.
Figure 2: The ResAD framework fuses multi-view images and LiDAR, generates a perturbed inertial reference cluster, and uses diffusion decoders with cross-attention for trajectory prediction.
Trajectory Residual Modeling
Given the ego-vehicle’s position p0 and velocity v0, the inertial reference trajectory is computed as:
pti=p0+v0⋅Δti
The residual is defined as the point-wise difference between the ground-truth trajectory and the inertial reference:
r=τgt−τref
The model’s objective is to predict this residual, focusing learning capacity on the causal elements of driving.
Point-wise Residual Normalization (PRNorm)
To prevent optimization from being dominated by large-magnitude errors at distant waypoints, PRNorm applies min-max scaling to each residual component:
r~td=2γ(rmaxd−rmind+ϵ0rtd−rmind)−γ
This normalization ensures balanced learning across all trajectory points, accelerating convergence and improving near-term safety-critical adjustments.
Figure 3: PRNorm accelerates training convergence and improves mean PDMS compared to vanilla normalization.
Inertia Reference Perturbation
ResAD generates multi-modal trajectories by perturbing the initial velocity vector with Gaussian noise, producing a cluster of inertial references. Each reference is propagated through the constant velocity model, and the model predicts residuals for each, yielding diverse, context-relevant trajectories.
Diffusion-Based Planning
ResAD leverages a vanilla diffusion model, where the denoising process is conditioned on the encoded features and the perturbed inertial references. During inference, only two denoising steps are required, demonstrating computational efficiency.
Multimodal Trajectory Ranking
Inspired by VADv2 and Hydra-MDP, ResAD incorporates a trajectory ranker. Candidate trajectories are scored using a Transformer that interacts with perception features and ego status embeddings. The ranker is trained to distill knowledge from rule-based planners and ground-truth waypoints, selecting the highest-scoring trajectory for execution.
Experimental Results
ResAD achieves state-of-the-art results on the NAVSIM v1 and v2 benchmarks. On NAVSIM v1, ResAD attains a PDMS of 88.6, outperforming prior methods in key metrics such as Drivable Area Compliance (DAC) and Ego Progress (EP). On NAVSIM v2, ResAD achieves an EPDMS of 85.5, surpassing DiffusionDrive and demonstrating superior performance in extended metrics, including lane keeping and comfort.
Ablation studies confirm the effectiveness of each component. Trajectory Residual Modeling and PRNorm yield significant improvements in DAC and EP, while Inertia Reference Perturbation enhances multi-modal planning and overall PDMS. The framework generalizes well, improving performance when integrated into other planning models such as Transfuser and TransfuserDP.
Figure 4: ResAD dynamically generates context-aware trajectories by perturbing the ego-vehicle’s velocity, avoiding the infeasible options produced by static vocabulary-based methods like DiffusionDrive.
Practical and Theoretical Implications
ResAD’s decomposition of trajectory prediction into inertial reference and residuals introduces a physically grounded prior, simplifying the learning task and improving interpretability. The use of PRNorm addresses the optimization imbalance inherent in trajectory prediction, ensuring that safety-critical near-term corrections are prioritized. The multi-modal planning strategy, based on reference perturbation, enables efficient and context-aware trajectory generation without reliance on static vocabularies.
From a practical perspective, ResAD’s architecture is compatible with existing sensor fusion backbones and can be trained efficiently using diffusion models with minimal denoising steps. The framework’s generalizability is validated by its performance gains when applied to heterogeneous planning networks.
Theoretically, the shift from direct trajectory prediction to residual modeling anchored by physical priors may inspire future research in causal inference for autonomous systems. The explicit separation of baseline and deviation could facilitate more robust reasoning about intent and environmental factors.
Conclusion
ResAD reframes end-to-end autonomous driving by introducing Normalized Residual Trajectory Modeling, anchored by a deterministic inertial reference and enhanced by point-wise normalization and multi-modal perturbation. This approach mitigates causal confusion and the planning horizon dilemma, yielding state-of-the-art performance on NAVSIM benchmarks. The framework’s modularity and generalizability position it as a robust foundation for future E2EAD research, with implications for both practical deployment and theoretical advancement in causal modeling for autonomous systems.