TB-DiT: BEV Diffusion for Trajectory Planning
- The paper introduces TB-DiT, a diffusion-based transformer integrating BEV heatmaps and ego-state embeddings for annotation-free trajectory planning.
- It employs a dual-stage architecture with sensor fusion and specialized cross-attention, refining predictions using a Gaussian heatmap as a proxy supervisory signal.
- Experimental evaluation on NAVSIM shows TB-DiT to improve performance by at least 2 PDMS points over previous annotation-free methods, rivaling perception-based models.
The Trajectory-oriented BEV Diffusion Transformer (TB-DiT) is a generative planning module integrated into the annotation-free, end-to-end autonomous driving framework TrajDiff. TB-DiT synthesizes multimodal, plausible future ego-vehicle trajectories from raw sensor observations, fusing environment context from self-supervised Bird’s-Eye-View (BEV) features and predictive embeddings of ego-state in a diffusion-based transformer architecture. Unlike conventional frameworks, TB-DiT eschews explicit perception supervision and handcrafted motion anchors, employing a Gaussian BEV heatmap as a proxy supervisory signal for trajectory compliance with road context and navigation intent. Evaluation on NAVSIM demonstrates that TB-DiT achieves state-of-the-art performance among annotation-free methods, challenging perception-annotated plans in large-scale settings (Gui et al., 30 Nov 2025).
1. Architectural Composition and Input Encoding
TB-DiT operates atop a dual-stage encoder paradigm: the trajectory-oriented BEV encoder and the diffusion transformer core.
Trajectory-oriented BEV Encoder:
- Sensor fusion aggregates camera and LiDAR features via a Transfuser-style backbone, yielding .
- Ego-state signals (velocity, acceleration, high-level command) are embedded as through an MLP.
- A learned query bank undergoes transformer-based contextualization. The transformer is applied as:
- A lightweight decoder maps the contextualized queries to a heatmap prediction .
- The BEV feature and heatmap are fused through , forming the TrajBEV feature for subsequent conditioning.
Summary Table: TB-DiT Input/Feature Flow
| Stage | Input(s) | Output(s) |
|---|---|---|
| Sensor Fusion Backbone | Camera, LiDAR | |
| Ego-state Encoder | Velocity, Acceleration, Command | |
| Transformer Query Interaction | ||
| Heatmap Decoder | ||
| BEV/Heatmap Fusion |
2. Diffusion Transformer Core
At each step of the denoising process, TB-DiT operates over noisy trajectory states (positions and heading), with trajectory context , an ego-state query , and timestep embedding . Its structure incorporates specialized attention mechanisms to capture temporal, spatial, and ego-centric dependencies:
- Ego-BEV Interaction: Cross-attention module updates with , yielding the composite condition .
- Trajectory Encoder: is projected into latent via a small encoder .
- Core TB-DiT Block (repeated times):
- Temporal self-attention: via
- BEV cross-attention: compress using a Q-former to produce ; apply to fuse and
- MLP decoding: maps to predicted noise .
- Reverse Diffusion: Using the predicted noise , the next state is sampled as:
- Deterministic DDIM sampling can optionally omit the noise term for efficient rollout generation.
3. Mathematical Formulation and Supervisory Signals
TB-DiT employs a discrete diffusion model directly over continuous future trajectories. The forward process consists of iteratively adding Gaussian noise:
The denoising (reverse) process estimates the score function via the transformer, using noise prediction for each conditioned trajectory state.
Gaussian BEV Heatmap Target:
The heatmap supervision eschews annotated perception maps; instead, it encodes the stochastic reachable set of the agent as:
where and are the position and velocity at step . This aligns supervision to plausible driveable space induced solely by expert demonstrations.
Loss Objectives:
- BEV heatmap loss (Gaussian focal loss) penalizes misalignment between predicted and target heatmaps.
- Diffusion loss is the mean squared error between predicted and actual noise:
- Weighted combination: , with , in ablation-reported settings.
4. Inference, Sampling, and Hyperparameters
TB-DiT employs a deterministic DDIM sampling procedure with a default of 20 steps. For each scenario and -trajectory rollout, the process is:
- Sample for each .
- For , compute conditioning , update as above using DDIM or standard diffusion.
- Collect as the candidate planned trajectories.
Key hyperparameters:
- Diffusion steps (evaluated $12-40$).
- Sampler: DDIM.
- Core block repetition and channel dimensions as determined empirically.
Gaussian Heatmap Configuration:
Velocity-aware yields higher performance than fixed-radius alternatives (velocity-aware: 87.5 PDMS vs. fixed : 87.2, : 85.7).
5. Experimental Evaluation
NAVSIM evaluation with the Planning Distance Metric Score (PDMS) demonstrates TB-DiT’s efficacy:
| Method | Percep. Annotation | Anchor-free | PDMS |
|---|---|---|---|
| TrajDiff (C+L) | × | ✓ | 87.5 |
| TrajDiff* (w/ scaling) | × | ✓ | 88.5 |
| LAW | × | ✓ | 83.8 |
| DiffusionDrive | ✓ | × | 88.1 |
| WoTE | ✓ | × | 88.3 |
| Transfuser-DP | ✓ | ✓ | 85.7 |
| World4Drive | × | × | 85.1 |
Notably, TrajDiff/TB-DiT surpasses all prior annotation-free frameworks by at least 2 PDMS points, and with data scaling reaches parity with the best perception-annotated diffusion baselines (Gui et al., 30 Nov 2025).
Ablation Results:
- Full TB-DiT (with TrajBEV, Ego-BEV interaction, and cross-attention): 87.5 PDMS.
- Removing TrajBEV or self-supervised heatmap supervision significantly degrades performance (to 81.6 PDMS).
- Isolating ego-BEV interaction or cross-attention yields intermediate gains (up to 87.1 PDMS).
Data Scaling:
Combining initial-point resampling and increased data yields 88.5 PDMS, approaching fully supervised models.
6. Key Distinctions and Significance
TB-DiT introduces several defining characteristics compared to earlier diffusion-based planning and prediction frameworks:
- Perception annotation-free: Trajectory compliance is enforced using a proxy heatmap loss, obviating the need for semantic maps or dense pixel-level supervision.
- Anchor-free, diverse generation: No handcrafted motion anchors; plausible modes emerge directly via the self-supervised structure of the BEV heatmap and trajectory diffusion process.
- Ego-context fusion: The explicit cross-attention between ego query and BEV features improves conditioning fidelity, verified by quantitative ablation.
A plausible implication is that TB-DiT provides a scalable blueprint for deploying diffusion-based planners in domains and geographies lacking semantic or perception annotations, with robustness to data scaling.
7. Relationship to Related Approaches
TB-DiT can be contrasted with contemporaneous works such as TopoDiffuser (Xu et al., 1 Aug 2025), which incorporate explicit topometric maps in BEV-encoded conditioning for trajectory generation, leveraging an auxiliary road segmentation loss for geometric compliance. TB-DiT, in contrast, eliminates such supervision, instead transferring environmental regularities through a trajectory-aligned Gaussian heatmap. While TopoDiffuser achieves strong results on multimodal trajectory prediction (e.g., KITTI), TB-DiT demonstrates competitive or superior performance in direct planning settings, particularly when large-scale, annotation-free data is utilized.
In summary, the Trajectory-oriented BEV Diffusion Transformer (TB-DiT) constitutes a principled, self-supervised blueprint for BEV-conditioned generative trajectory planning, establishing state-of-the-art results in perception annotation-free, end-to-end autonomous driving (Gui et al., 30 Nov 2025).