ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving

Published 9 Oct 2025 in cs.CV and cs.RO | (2510.08562v1)

Abstract: End-to-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of causal inference, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes the learning task to predict the residual deviation from a deterministic inertial reference. The inertial reference serves as a counterfactual, forcing the model to move beyond simple pattern recognition and instead identify the underlying causal factors (e.g., traffic rules, obstacles) that necessitate deviations from a default, inertially-guided path. To deal with the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. It re-weights the optimization objective, preventing large-magnitude errors associated with distant, uncertain waypoints from dominating the learning signal. Extensive experiments validate the effectiveness of our framework. On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy with only two denoising steps, demonstrating that our approach significantly simplifies the learning task and improves model performance. The code will be released to facilitate further research.

Abstract PDF Upgrade to Chat

Summary

The paper introduces ResAD, a framework that decomposes trajectory prediction into an inertial reference and residual deviations to mitigate causal confusion and planning horizon issues.
It employs point-wise residual normalization and diffusion-based planning, achieving state-of-the-art results on NAVSIM benchmarks with improved PDMS and safety metrics.
The approach integrates multi-modal sensor fusion and a trajectory ranker, enhancing contextual planning and offering generalizable benefits for end-to-end autonomous driving.

Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving

Motivation and Problem Formulation

End-to-end autonomous driving (E2EAD) systems have shifted the paradigm from modular pipelines to unified models that directly map sensor data to future trajectories. However, direct trajectory prediction is fundamentally challenged by spatio-temporal non-uniformity in trajectory data, leading to two critical issues: causal confusion and the planning horizon dilemma. Causal confusion arises when models learn spurious correlations rather than true causal relationships, while the planning horizon dilemma refers to the dominance of uncertain, long-horizon errors in the optimization process, which can compromise immediate safety.

The ResAD framework introduces Normalized Residual Trajectory Modeling to address these issues. Instead of predicting the trajectory outright, ResAD decomposes the task into predicting the residual deviation from a deterministic inertial reference, which is extrapolated from the ego-vehicle’s current state. This inertial reference acts as a counterfactual baseline, compelling the model to focus on the causal factors necessitating deviation (e.g., obstacles, traffic rules) rather than mere statistical patterns.

Figure 1: (a) Raw trajectory distributions show mean drift and increasing variance; residual modeling centers the distribution, and point-wise normalization further stabilizes variance. (b) ResAD uses an inertial reference as a baseline, forcing the model to learn causal deviations rather than statistical correlations.

Architecture and Implementation

ResAD employs a multi-modal sensor fusion backbone, integrating multi-view camera images and LiDAR data via a feature interaction encoder. The inertial reference is generated from the ego-vehicle’s state and perturbed into a cluster to ensure robustness and enable multi-modal predictions. Diffusion decoders, conditioned on this reference cluster, merge the encoded features through cross-attention to output planned trajectories.

Figure 2: The ResAD framework fuses multi-view images and LiDAR, generates a perturbed inertial reference cluster, and uses diffusion decoders with cross-attention for trajectory prediction.

Trajectory Residual Modeling

Given the ego-vehicle’s position $\mathbf{p}_0$ and velocity $\mathbf{v}_0$ , the inertial reference trajectory is computed as:

$\mathbf{p}_{t_i} = \mathbf{p}_0 + \mathbf{v}_0 \cdot \Delta t_i$

The residual is defined as the point-wise difference between the ground-truth trajectory and the inertial reference:

$\boldsymbol{r} = \tau_{\mathrm{gt}} - \tau_{\mathrm{ref}}$

The model’s objective is to predict this residual, focusing learning capacity on the causal elements of driving.

Point-wise Residual Normalization (PRNorm)

To prevent optimization from being dominated by large-magnitude errors at distant waypoints, PRNorm applies min-max scaling to each residual component:

$\tilde{r}_t^d = 2\gamma \left( \frac{r_t^d - r_{\min}^d}{r_{\max}^d - r_{\min}^d + \epsilon_0} \right) - \gamma$

This normalization ensures balanced learning across all trajectory points, accelerating convergence and improving near-term safety-critical adjustments.

Figure 3: PRNorm accelerates training convergence and improves mean PDMS compared to vanilla normalization.

Inertia Reference Perturbation

ResAD generates multi-modal trajectories by perturbing the initial velocity vector with Gaussian noise, producing a cluster of inertial references. Each reference is propagated through the constant velocity model, and the model predicts residuals for each, yielding diverse, context-relevant trajectories.

Diffusion-Based Planning

ResAD leverages a vanilla diffusion model, where the denoising process is conditioned on the encoded features and the perturbed inertial references. During inference, only two denoising steps are required, demonstrating computational efficiency.

Multimodal Trajectory Ranking

Inspired by VADv2 and Hydra-MDP, ResAD incorporates a trajectory ranker. Candidate trajectories are scored using a Transformer that interacts with perception features and ego status embeddings. The ranker is trained to distill knowledge from rule-based planners and ground-truth waypoints, selecting the highest-scoring trajectory for execution.

Experimental Results

ResAD achieves state-of-the-art results on the NAVSIM v1 and v2 benchmarks. On NAVSIM v1, ResAD attains a PDMS of 88.6, outperforming prior methods in key metrics such as Drivable Area Compliance (DAC) and Ego Progress (EP). On NAVSIM v2, ResAD achieves an EPDMS of 85.5, surpassing DiffusionDrive and demonstrating superior performance in extended metrics, including lane keeping and comfort.

Ablation studies confirm the effectiveness of each component. Trajectory Residual Modeling and PRNorm yield significant improvements in DAC and EP, while Inertia Reference Perturbation enhances multi-modal planning and overall PDMS. The framework generalizes well, improving performance when integrated into other planning models such as Transfuser and Transfuser $_\mathrm{DP}$ .

Figure 4: ResAD dynamically generates context-aware trajectories by perturbing the ego-vehicle’s velocity, avoiding the infeasible options produced by static vocabulary-based methods like DiffusionDrive.

Practical and Theoretical Implications

ResAD’s decomposition of trajectory prediction into inertial reference and residuals introduces a physically grounded prior, simplifying the learning task and improving interpretability. The use of PRNorm addresses the optimization imbalance inherent in trajectory prediction, ensuring that safety-critical near-term corrections are prioritized. The multi-modal planning strategy, based on reference perturbation, enables efficient and context-aware trajectory generation without reliance on static vocabularies.

From a practical perspective, ResAD’s architecture is compatible with existing sensor fusion backbones and can be trained efficiently using diffusion models with minimal denoising steps. The framework’s generalizability is validated by its performance gains when applied to heterogeneous planning networks.

Theoretically, the shift from direct trajectory prediction to residual modeling anchored by physical priors may inspire future research in causal inference for autonomous systems. The explicit separation of baseline and deviation could facilitate more robust reasoning about intent and environmental factors.

Conclusion

ResAD reframes end-to-end autonomous driving by introducing Normalized Residual Trajectory Modeling, anchored by a deterministic inertial reference and enhanced by point-wise normalization and multi-modal perturbation. This approach mitigates causal confusion and the planning horizon dilemma, yielding state-of-the-art performance on NAVSIM benchmarks. The framework’s modularity and generalizability position it as a robust foundation for future E2EAD research, with implications for both practical deployment and theoretical advancement in causal modeling for autonomous systems.