TB-DiT: BEV Diffusion for Trajectory Planning

Updated 7 December 2025

The paper introduces TB-DiT, a diffusion-based transformer integrating BEV heatmaps and ego-state embeddings for annotation-free trajectory planning.
It employs a dual-stage architecture with sensor fusion and specialized cross-attention, refining predictions using a Gaussian heatmap as a proxy supervisory signal.
Experimental evaluation on NAVSIM shows TB-DiT to improve performance by at least 2 PDMS points over previous annotation-free methods, rivaling perception-based models.

The Trajectory-oriented BEV Diffusion Transformer (TB-DiT) is a generative planning module integrated into the annotation-free, end-to-end autonomous driving framework TrajDiff. TB-DiT synthesizes multimodal, plausible future ego-vehicle trajectories from raw sensor observations, fusing environment context from self-supervised Bird’s-Eye-View (BEV) features and predictive embeddings of ego-state in a diffusion-based transformer architecture. Unlike conventional frameworks, TB-DiT eschews explicit perception supervision and handcrafted motion anchors, employing a Gaussian BEV heatmap as a proxy supervisory signal for trajectory compliance with road context and navigation intent. Evaluation on NAVSIM demonstrates that TB-DiT achieves state-of-the-art performance among annotation-free methods, challenging perception-annotated plans in large-scale settings (Gui et al., 30 Nov 2025).

1. Architectural Composition and Input Encoding

TB-DiT operates atop a dual-stage encoder paradigm: the trajectory-oriented BEV encoder and the diffusion transformer core.

Trajectory-oriented BEV Encoder:

Sensor fusion aggregates camera and LiDAR features via a Transfuser-style backbone, yielding $F_{\text{bev}} \in \mathbb{R}^{H \times W \times C}$ .
Ego-state signals (velocity, acceleration, high-level command) are embedded as $F_{\text{ego}} \in \mathbb{R}^{1 \times C}$ through an MLP.
A learned query bank $Q_{\text{heatmap}} \in \mathbb{R}^{N \times C}$ undergoes transformer-based contextualization. The transformer $\mathcal{F}_H$ is applied as:

$\hat{Q}_{\text{heatmap}} = \mathcal{F}_H(Q_{\text{heatmap}}, \operatorname{concat}(F_{\text{bev}}, F_{\text{ego}}))$

A lightweight decoder $\mathcal{D}_H$ maps the contextualized queries to a heatmap prediction $F_{\text{heatmap}} \in \mathbb{R}^{H \times W \times 1}$ .
The BEV feature and heatmap are fused through $\mathcal{G}_{\text{fuse}}$ , forming the TrajBEV feature $F_{\text{traj}}$ for subsequent conditioning.

Summary Table: TB-DiT Input/Feature Flow

Stage	Input(s)	Output(s)
Sensor Fusion Backbone	Camera, LiDAR	$F_{\text{bev}}$
Ego-state Encoder	Velocity, Acceleration, Command	$F_{\text{ego}}$
Transformer Query Interaction	$Q_{\text{heatmap}}, F_{\text{bev}}, F_{\text{ego}}$	$\hat{Q}_{\text{heatmap}}$
Heatmap Decoder	$\hat{Q}_{\text{heatmap}}$	$F_{\text{heatmap}}$
BEV/Heatmap Fusion	$F_{\text{bev}}, F_{\text{heatmap}}$	$F_{\text{traj}}$

2. Diffusion Transformer Core

At each step of the denoising process, TB-DiT operates over noisy trajectory states $X_t \in \mathbb{R}^{T_f \times 3}$ (positions and heading), with trajectory context $F_{\text{traj}}$ , an ego-state query $Q_{\text{ego}}$ , and timestep embedding $F_t$ . Its structure incorporates specialized attention mechanisms to capture temporal, spatial, and ego-centric dependencies:

Ego-BEV Interaction: Cross-attention module $\mathcal{G}_{EB}$ updates $Q_{\text{ego}}$ with $F_{\text{traj}}$ , yielding the composite condition $C = F_t + \hat{Q}_{\text{ego}}$ .
Trajectory Encoder: $X_t$ is projected into latent $Z_t$ via a small encoder $\mathcal{E}_{\text{traj}}$ .
Core TB-DiT Block (repeated $L$ times):

Temporal self-attention: $Z_t \rightarrow \tilde{Z}_t$ via $SA_t$
BEV cross-attention: compress $F_{\text{traj}}$ using a Q-former to produce $Q_{\text{BEV}}$ ; apply $CA_{\text{BEV}}$ to fuse $Z_t$ and $Q_{\text{BEV}}$
MLP decoding: $\mathcal{D}_{\text{traj}}$ maps $Z_t$ to predicted noise $\hat{X}_t$ .

Reverse Diffusion: Using the predicted noise $\epsilon_\theta(X_t, t, C)$ , the next state is sampled as:

$X_{t-1} = \frac{1}{\sqrt{\alpha_t}}\Bigl(X_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(X_t, t, C)\Bigr) + \sigma_t \delta, \;\; \delta \sim \mathcal{N}(0, I)$

Deterministic DDIM sampling can optionally omit the noise term for efficient rollout generation.

3. Mathematical Formulation and Supervisory Signals

TB-DiT employs a discrete diffusion model directly over continuous future trajectories. The forward process consists of iteratively adding Gaussian noise:

$q(X_t | X_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} X_0, (1 - \bar{\alpha}_t) I)$

The denoising (reverse) process estimates the score function via the transformer, using noise prediction for each conditioned trajectory state.

Gaussian BEV Heatmap Target:

The heatmap supervision eschews annotated perception maps; instead, it encodes the stochastic reachable set of the agent as:

$GT_{xy} = \max_{i=1 \ldots T_f} \exp\Bigl( -\frac{(x - x_i)^2 + (y - y_i)^2}{2 (\Gamma v_i)^2} \Bigr)$

where $(x_i, y_i)$ and $v_i$ are the position and velocity at step $i$ . This aligns supervision to plausible driveable space induced solely by expert demonstrations.

Loss Objectives:

BEV heatmap loss (Gaussian focal loss) $L_{\text{bev}}$ penalizes misalignment between predicted and target heatmaps.
Diffusion loss $L_{\text{diff}}$ is the mean squared error between predicted and actual noise:

$L_{\text{diff}} = \mathbb{E}_{t, X_0, \epsilon_t} [\lVert \epsilon_\theta(X_t, t, C) - \epsilon_t \rVert^2]$

Weighted combination: $L = \omega_1 L_{\text{bev}} + \omega_2 L_{\text{diff}}$ , with $\omega_1 = 200$ , $\omega_2 = 10$ in ablation-reported settings.

4. Inference, Sampling, and Hyperparameters

TB-DiT employs a deterministic DDIM sampling procedure with a default of 20 steps. For each scenario and $K$ -trajectory rollout, the process is:

Sample $X_T \sim \mathcal{N}(0, I)$ for each $k$ .
For $t = T \ldots 1$ , compute conditioning $C = F_t + \mathcal{G}_{EB}(Q_{\text{ego}}, F_{\text{traj}})$ , update $X_{t-1}$ as above using DDIM or standard diffusion.
Collect $\{X_0^{(k)}\}_{k=1}^K$ as the candidate planned trajectories.

Key hyperparameters:

Diffusion steps $T=20$ (evaluated $12-40$).
Sampler: DDIM.
Core block repetition and channel dimensions as determined empirically.

Gaussian Heatmap Configuration:

Velocity-aware $\Gamma$ yields higher performance than fixed-radius alternatives (velocity-aware: 87.5 PDMS vs. fixed $r=5\,\text{m}$ : 87.2, $r=25\,\text{m}$ : 85.7).

5. Experimental Evaluation

NAVSIM evaluation with the Planning Distance Metric Score (PDMS) demonstrates TB-DiT’s efficacy:

Method	Percep. Annotation	Anchor-free	PDMS
TrajDiff (C+L)	×	✓	87.5
TrajDiff* (w/ scaling)	×	✓	88.5
LAW	×	✓	83.8
DiffusionDrive	✓	×	88.1
WoTE	✓	×	88.3
Transfuser-DP	✓	✓	85.7
World4Drive	×	×	85.1

Notably, TrajDiff/TB-DiT surpasses all prior annotation-free frameworks by at least 2 PDMS points, and with data scaling reaches parity with the best perception-annotated diffusion baselines (Gui et al., 30 Nov 2025).

Ablation Results:

Full TB-DiT (with TrajBEV, Ego-BEV interaction, and cross-attention): 87.5 PDMS.
Removing TrajBEV or self-supervised heatmap supervision significantly degrades performance (to 81.6 PDMS).
Isolating ego-BEV interaction or cross-attention yields intermediate gains (up to 87.1 PDMS).

Data Scaling:

Combining initial-point resampling and increased data yields 88.5 PDMS, approaching fully supervised models.

6. Key Distinctions and Significance

TB-DiT introduces several defining characteristics compared to earlier diffusion-based planning and prediction frameworks:

Perception annotation-free: Trajectory compliance is enforced using a proxy heatmap loss, obviating the need for semantic maps or dense pixel-level supervision.
Anchor-free, diverse generation: No handcrafted motion anchors; plausible modes emerge directly via the self-supervised structure of the BEV heatmap and trajectory diffusion process.
Ego-context fusion: The explicit cross-attention between ego query and BEV features improves conditioning fidelity, verified by quantitative ablation.

A plausible implication is that TB-DiT provides a scalable blueprint for deploying diffusion-based planners in domains and geographies lacking semantic or perception annotations, with robustness to data scaling.

TB-DiT can be contrasted with contemporaneous works such as TopoDiffuser (Xu et al., 1 Aug 2025), which incorporate explicit topometric maps in BEV-encoded conditioning for trajectory generation, leveraging an auxiliary road segmentation loss for geometric compliance. TB-DiT, in contrast, eliminates such supervision, instead transferring environmental regularities through a trajectory-aligned Gaussian heatmap. While TopoDiffuser achieves strong results on multimodal trajectory prediction (e.g., KITTI), TB-DiT demonstrates competitive or superior performance in direct planning settings, particularly when large-scale, annotation-free data is utilized.

In summary, the Trajectory-oriented BEV Diffusion Transformer (TB-DiT) constitutes a principled, self-supervised blueprint for BEV-conditioned generative trajectory planning, establishing state-of-the-art results in perception annotation-free, end-to-end autonomous driving (Gui et al., 30 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

TrajDiff: End-to-end Autonomous Driving without Perception Annotation (2025)

TopoDiffuser: A Diffusion-Based Multimodal Trajectory Prediction Model with Topometric Maps (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory-oriented BEV Diffusion Transformer (TB-DiT).

TB-DiT: BEV Diffusion for Trajectory Planning

1. Architectural Composition and Input Encoding

2. Diffusion Transformer Core

3. Mathematical Formulation and Supervisory Signals

4. Inference, Sampling, and Hyperparameters

5. Experimental Evaluation

6. Key Distinctions and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TB-DiT: BEV Diffusion for Trajectory Planning

1. Architectural Composition and Input Encoding

2. Diffusion Transformer Core

3. Mathematical Formulation and Supervisory Signals

4. Inference, Sampling, and Hyperparameters

5. Experimental Evaluation

6. Key Distinctions and Significance

7. Relationship to Related Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research