Pseudo-Simulation for Autonomous Driving

Published 4 Jun 2025 in cs.RO, cs.AI, cs.CV, and cs.LG | (2506.04218v1)

Abstract: Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations (R^2=0.8) than the best existing open-loop approach (R^2=0.7). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.

Abstract PDF Upgrade to Chat

Summary

The paper presents a two-stage pseudo-simulation method that augments real data with pre-generated synthetic views for robust AV evaluation.
It achieves high correlation (R²=0.8) with closed-loop simulation while reducing planner inferences by 6×, enhancing efficiency.
The approach integrates a novel scoring mechanism and neural rendering pipeline to mitigate label noise and capture realistic driving dynamics.

This paper introduces "pseudo-simulation," a novel evaluation paradigm for Autonomous Vehicles (AVs) that aims to bridge the gap between scalable open-loop evaluation and comprehensive closed-loop simulation. It addresses the limitations of existing methods: real-world testing is unsafe and lacks reproducibility, closed-loop simulation is computationally expensive and can suffer from realism gaps, and open-loop evaluation overlooks compounding errors and distribution shifts.

The core idea of pseudo-simulation is to evaluate AVs on real datasets augmented with pre-generated synthetic observations. This process occurs in two stages:

Stage 1: Initial Observations: The AV's planning model predicts a trajectory based on a real-world observation from a dataset. This trajectory is then simulated in a Bird's Eye View (BEV) environment for a fixed horizon (4 seconds).
Stage 2: Synthetic Observations: The AV is evaluated on a set of synthetic observations. These observations are generated prior to evaluation by rendering novel views corresponding to plausible future states the AV might encounter if it deviates from the expert path. The starting points for these synthetic observations are sampled around the expert driver's endpoint from the original data.

A key aspect is how Stage 2 scores are weighted. The importance of each synthetic observation is determined by its proximity to the endpoint of the trajectory planned by the AV in Stage 1. This prioritizes the evaluation of futures that are more likely given the AV's initial plan.

Implementation Details

Stage 1: Initial Observations

Task: Fixed-horizon (4-second) trajectory planning based on multi-view camera images, ego status (velocity, motion history), and a high-level driving command (left, straight, right).
BEV Simulation:
- The AV's predicted trajectory is executed using a kinematic bicycle model and an LQR controller at 10Hz.
- Unlike some open-loop setups, background traffic is reactive, using the Intelligent Driver Model (IDM) to respond to the ego agent.
Scoring Metric (EPDMS): The Extended Predictive Driver Model Score (EPDMS) is used. It combines multiplicative penalties for rule violations and a weighted average of several subscores:

$EPDMS = \prod_{m \in \mathcal{M}_\text{pen}} \text{filter}_m(\text{agent}, \text{human}) \cdot \frac{ \sum_{m \in \mathcal{M}_\text{avg}} w_m \cdot \text{filter}_m(\text{agent}, \text{human}) }{ \sum_{m \in \mathcal{M}_\text{avg}} w_m }$
- Penalty terms ( $\mathcal{M}_\text{pen}$ ): No at-fault Collision (NC), Drivable Area Compliance (DAC), Driving Direction Compliance (DDC), Traffic Light Compliance (TLC).
- Weighted average terms ( $\mathcal{M}_\text{avg}$ ): Ego Progress (EP), Time to Collision (TTC), Lane Keeping (LK), History Comfort (HC), Extended Comfort (EC). Weights ( $w_m$ ) are specified (e.g., EP and TTC have a weight of 5).
- Novel Filtering Mechanism: A crucial addition is a filter that prevents penalizing the AV for rule violations if the human expert driver also committed the same violation in that scene.
  
  $\text{filter}_m(\text{agent}, \text{human}) = \begin{cases} 1.0 & \text{if } m(\text{human}) = 0 \ m(\text{agent}) & \text{otherwise} \end{cases}$
  
  This handles label noise or contextually justified "violations."

Stage 2: Synthetic Observations

Pre-generation of Synthetic Observations: This is a data pre-processing step done before evaluating any specific planner.
- Start Point Sampling: Viewpoints for synthetic observations are sampled around the expert driver's observed endpoint after 4 seconds.
- Lateral sampling: every 0.5m up to 2.0m on each side.
- Longitudinal sampling: every 5.0m, spanning a physically plausible range (min stopping distance to max reachable distance, assuming accelerations of $\pm 4.0 \, m/s^2$ for 4s). This results in more states for high-speed scenarios (up to 20).
- Heading and History Generation: For each sampled start point, a plausible heading and motion history are generated by matching it to the nearest trajectory in a human driving dataset, with filters for velocity, acceleration, and heading differences relative to the expert.
- Rejection Sampling: Start points violating EPDMS penalty constraints (NC, DAC, DDC, TLC) are removed. Scenes with fewer than five valid synthetic observations after filtering are discarded.
- Neural Reconstruction and Rendering: A modified version of Multi-Traversal Gaussian Splatting (MTGS) is used.
- It uses only a single traversal from the dataset, expanding data usability.
- Camera poses are refined using LiDAR registration, bundle adjustment, and then optimized during MTGS training.
- Scenes with significant sensor failures (e.g., water droplets) are filtered out before reconstruction.
- A semi-automatic filtering step discards reconstructed scenes of low visual quality.
Score Aggregation: Stage 1 score ( $s_1$ $s_{1}$ ) and Stage 2 scores ( $\{s_2^i\}$ ${s_{2}^{i}}$ for each synthetic observation $i$ $i$ ) are combined.
- Let $x^i$ be the start point of the $i$ -th Stage 2 scenario, and $\hat{x}$ be the ego agent's endpoint from Stage 1.
- The aggregated Stage 2 score ( $s_2$ ) is a Gaussian-weighted average of individual synthetic observation scores:
- $s_2 = \sum_i \hat{w}^i s_2^i$ , where $\hat{w}^i = \frac{w^i}{\sum_j w^j}$ and $w^i = \exp\left(-\frac{\|x^i - \hat{x}\|^2}{2\sigma^2}\right)$ .
- The kernel variance $\sigma^2$ is a hyperparameter (default $\sigma^2 = 0.1$ found to be effective).
- The final combined score is $s_{\text{combined}} = s_1 \cdot s_2$ . Multiplicative aggregation was found to correlate best with closed-loop scores.

Experimental Results and Practical Implications

High Correlation with Closed-Loop Simulation:
- Pseudo-simulation achieves a strong correlation ( $R^2=0.8$ ) with the nuPlan closed-loop simulator across 83 diverse planners. This is better than the best open-loop approach ( $R^2=0.7$ ).
- This suggests pseudo-simulation can serve as a more reliable proxy for computationally expensive closed-loop testing.
Efficiency:
- Pseudo-simulation requires significantly fewer planner inferences (13 per scenario on average: 1 for Stage 1, ~12 for Stage 2) compared to closed-loop simulation in nuPlan (80 inferences for an 8-second rollout at 10Hz). This is a 6x reduction.
- This efficiency allows for faster iteration cycles in AV development.
Robustness Evaluation:
- The two-stage approach effectively tests for error recovery and sensitivity to perturbations. Performance generally drops from Stage 1 (real data) to Stage 2 (synthetic, perturbed data), highlighting model brittleness.
- The weighting scheme ( $\sigma^2=0.1$ ) for Stage 2 scores based on proximity to the Stage 1 planned endpoint is critical for achieving high correlation.
- Even with reduced synthetic view density (e.g., 25% of views, ~3 Stage 2 observations per scene), correlation remains high (above $r=0.85$ ).

NAVSIM v2 Leaderboard:

A public leaderboard ("navhard") based on pseudo-simulation is established using a challenging subset of nuPlan (450 Stage 1, 5462 Stage 2 observations).
Example scenes show challenging unprotected turns and dense traffic.
Evaluation of existing planners (Constant Velocity, MLP, Latent TransFuser (LTF), PDM-Closed) on navhard revealed specific failure modes previously overlooked. For example, PDM-Closed, while strong in safety metrics, performed poorly on comfort metrics (HC, EC).

Example Leaderboard Insights (Table `tab:sota_transposed` in paper):
| Metric       | Stage | LTF (Vision) | PDM-C (Privileged) |
|--------------|-------|--------------|--------------------|
| NC (Coll.)   | S1    | 96.2         | 94.4               |
| NC (Coll.)   | S2    | 77.7         | 88.1               |
| LK (Lane Kp.)| S1    | 94.2         | 99.3               |
| LK (Lane Kp.)| S2    | 45.4         | 73.7               |
| EPDMS (Total)|       | 23.1         | 51.3               |

This shows LTF's performance degrades more significantly in Stage 2 than PDM-C's, particularly in metrics like Lane Keeping.

Neural Rendering Fidelity:
- The modified MTGS rendering pipeline provides sufficient visual fidelity for downstream tasks.
- When an LTF model (trained only on real data) was evaluated:
- On real Stage 1 data: Perception mIoU = 46.0, Planning EPDMS = 62.3.
- On synthetic Stage 1 data (same pose as real): mIoU = 37.6, EPDMS = 61.0. (Small drop in planning despite mIoU drop).
- On synthetic Stage 2 data (perturbed poses): mIoU = 36.9, EPDMS = 44.2. (Significant drop in planning, attributed to planner sensitivity to distribution shift rather than just rendering artifacts).
- Novel view synthesis quality (LPIPS) for the renderer is 0.253, outperforming a baseline (Street Gaussians at 0.354) and an ablation without pose optimization (0.322).

Deployment and Computational Considerations

Pre-processing Cost: Generating synthetic views using the MTGS-based scene optimization takes approximately 1-2 hours per scene on current hardware. While manageable for datasets under 1000 scenes, this is a limitation for extremely large datasets.
Framework: The code is available at https://github.com/autonomousvision/navsim, released as NAVSIM v2. This allows practitioners to adopt and benchmark with pseudo-simulation.

Limitations and Future Work

Real-World Correlation: Current validation is against established simulators, not directly with real-world deployment metrics.
Scalability of Pre-processing: The per-scene optimization for rendering is a bottleneck. Future work could explore faster, feedforward 3D scene representation methods.
Rendering Artifacts: While metrics are good, some visual artifacts may persist. Human perceptual studies could be beneficial.
Background Traffic Realism: Current Stage 2 traffic uses simple rule-based models. Incorporating more sophisticated, learned traffic models could improve fidelity.

Pseudo-simulation offers a practical and more efficient way to evaluate AVs, capturing aspects of error recovery and robustness typically requiring expensive closed-loop simulations. Its ability to reveal failure modes and the availability of a public benchmark make it a valuable tool for AV development.

Markdown Report Issue