Papers
Topics
Authors
Recent
Search
2000 character limit reached

TB-DiT: BEV Diffusion for Trajectory Planning

Updated 7 December 2025
  • The paper introduces TB-DiT, a diffusion-based transformer integrating BEV heatmaps and ego-state embeddings for annotation-free trajectory planning.
  • It employs a dual-stage architecture with sensor fusion and specialized cross-attention, refining predictions using a Gaussian heatmap as a proxy supervisory signal.
  • Experimental evaluation on NAVSIM shows TB-DiT to improve performance by at least 2 PDMS points over previous annotation-free methods, rivaling perception-based models.

The Trajectory-oriented BEV Diffusion Transformer (TB-DiT) is a generative planning module integrated into the annotation-free, end-to-end autonomous driving framework TrajDiff. TB-DiT synthesizes multimodal, plausible future ego-vehicle trajectories from raw sensor observations, fusing environment context from self-supervised Bird’s-Eye-View (BEV) features and predictive embeddings of ego-state in a diffusion-based transformer architecture. Unlike conventional frameworks, TB-DiT eschews explicit perception supervision and handcrafted motion anchors, employing a Gaussian BEV heatmap as a proxy supervisory signal for trajectory compliance with road context and navigation intent. Evaluation on NAVSIM demonstrates that TB-DiT achieves state-of-the-art performance among annotation-free methods, challenging perception-annotated plans in large-scale settings (Gui et al., 30 Nov 2025).

1. Architectural Composition and Input Encoding

TB-DiT operates atop a dual-stage encoder paradigm: the trajectory-oriented BEV encoder and the diffusion transformer core.

Trajectory-oriented BEV Encoder:

  • Sensor fusion aggregates camera and LiDAR features via a Transfuser-style backbone, yielding FbevRH×W×CF_{\text{bev}} \in \mathbb{R}^{H \times W \times C}.
  • Ego-state signals (velocity, acceleration, high-level command) are embedded as FegoR1×CF_{\text{ego}} \in \mathbb{R}^{1 \times C} through an MLP.
  • A learned query bank QheatmapRN×CQ_{\text{heatmap}} \in \mathbb{R}^{N \times C} undergoes transformer-based contextualization. The transformer FH\mathcal{F}_H is applied as:

Q^heatmap=FH(Qheatmap,concat(Fbev,Fego))\hat{Q}_{\text{heatmap}} = \mathcal{F}_H(Q_{\text{heatmap}}, \operatorname{concat}(F_{\text{bev}}, F_{\text{ego}}))

  • A lightweight decoder DH\mathcal{D}_H maps the contextualized queries to a heatmap prediction FheatmapRH×W×1F_{\text{heatmap}} \in \mathbb{R}^{H \times W \times 1}.
  • The BEV feature and heatmap are fused through Gfuse\mathcal{G}_{\text{fuse}}, forming the TrajBEV feature FtrajF_{\text{traj}} for subsequent conditioning.

Summary Table: TB-DiT Input/Feature Flow

Stage Input(s) Output(s)
Sensor Fusion Backbone Camera, LiDAR FbevF_{\text{bev}}
Ego-state Encoder Velocity, Acceleration, Command FegoF_{\text{ego}}
Transformer Query Interaction Qheatmap,Fbev,FegoQ_{\text{heatmap}}, F_{\text{bev}}, F_{\text{ego}} Q^heatmap\hat{Q}_{\text{heatmap}}
Heatmap Decoder Q^heatmap\hat{Q}_{\text{heatmap}} FheatmapF_{\text{heatmap}}
BEV/Heatmap Fusion Fbev,FheatmapF_{\text{bev}}, F_{\text{heatmap}} FtrajF_{\text{traj}}

2. Diffusion Transformer Core

At each step of the denoising process, TB-DiT operates over noisy trajectory states XtRTf×3X_t \in \mathbb{R}^{T_f \times 3} (positions and heading), with trajectory context FtrajF_{\text{traj}}, an ego-state query QegoQ_{\text{ego}}, and timestep embedding FtF_t. Its structure incorporates specialized attention mechanisms to capture temporal, spatial, and ego-centric dependencies:

  • Ego-BEV Interaction: Cross-attention module GEB\mathcal{G}_{EB} updates QegoQ_{\text{ego}} with FtrajF_{\text{traj}}, yielding the composite condition C=Ft+Q^egoC = F_t + \hat{Q}_{\text{ego}}.
  • Trajectory Encoder: XtX_t is projected into latent ZtZ_t via a small encoder Etraj\mathcal{E}_{\text{traj}}.
  • Core TB-DiT Block (repeated LL times):
  1. Temporal self-attention: ZtZ~tZ_t \rightarrow \tilde{Z}_t via SAtSA_t
  2. BEV cross-attention: compress FtrajF_{\text{traj}} using a Q-former to produce QBEVQ_{\text{BEV}}; apply CABEVCA_{\text{BEV}} to fuse ZtZ_t and QBEVQ_{\text{BEV}}
  3. MLP decoding: Dtraj\mathcal{D}_{\text{traj}} maps ZtZ_t to predicted noise X^t\hat{X}_t.
  • Reverse Diffusion: Using the predicted noise ϵθ(Xt,t,C)\epsilon_\theta(X_t, t, C), the next state is sampled as:

Xt1=1αt(Xt1αt1αˉtϵθ(Xt,t,C))+σtδ,    δN(0,I)X_{t-1} = \frac{1}{\sqrt{\alpha_t}}\Bigl(X_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(X_t, t, C)\Bigr) + \sigma_t \delta, \;\; \delta \sim \mathcal{N}(0, I)

  • Deterministic DDIM sampling can optionally omit the noise term for efficient rollout generation.

3. Mathematical Formulation and Supervisory Signals

TB-DiT employs a discrete diffusion model directly over continuous future trajectories. The forward process consists of iteratively adding Gaussian noise:

q(XtX0)=N(αˉtX0,(1αˉt)I)q(X_t | X_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} X_0, (1 - \bar{\alpha}_t) I)

The denoising (reverse) process estimates the score function via the transformer, using noise prediction for each conditioned trajectory state.

Gaussian BEV Heatmap Target:

The heatmap supervision eschews annotated perception maps; instead, it encodes the stochastic reachable set of the agent as:

GTxy=maxi=1Tfexp((xxi)2+(yyi)22(Γvi)2)GT_{xy} = \max_{i=1 \ldots T_f} \exp\Bigl( -\frac{(x - x_i)^2 + (y - y_i)^2}{2 (\Gamma v_i)^2} \Bigr)

where (xi,yi)(x_i, y_i) and viv_i are the position and velocity at step ii. This aligns supervision to plausible driveable space induced solely by expert demonstrations.

Loss Objectives:

  • BEV heatmap loss (Gaussian focal loss) LbevL_{\text{bev}} penalizes misalignment between predicted and target heatmaps.
  • Diffusion loss LdiffL_{\text{diff}} is the mean squared error between predicted and actual noise:

Ldiff=Et,X0,ϵt[ϵθ(Xt,t,C)ϵt2]L_{\text{diff}} = \mathbb{E}_{t, X_0, \epsilon_t} [\lVert \epsilon_\theta(X_t, t, C) - \epsilon_t \rVert^2]

  • Weighted combination: L=ω1Lbev+ω2LdiffL = \omega_1 L_{\text{bev}} + \omega_2 L_{\text{diff}}, with ω1=200\omega_1 = 200, ω2=10\omega_2 = 10 in ablation-reported settings.

4. Inference, Sampling, and Hyperparameters

TB-DiT employs a deterministic DDIM sampling procedure with a default of 20 steps. For each scenario and KK-trajectory rollout, the process is:

  1. Sample XTN(0,I)X_T \sim \mathcal{N}(0, I) for each kk.
  2. For t=T1t = T \ldots 1, compute conditioning C=Ft+GEB(Qego,Ftraj)C = F_t + \mathcal{G}_{EB}(Q_{\text{ego}}, F_{\text{traj}}), update Xt1X_{t-1} as above using DDIM or standard diffusion.
  3. Collect {X0(k)}k=1K\{X_0^{(k)}\}_{k=1}^K as the candidate planned trajectories.

Key hyperparameters:

  • Diffusion steps T=20T=20 (evaluated $12-40$).
  • Sampler: DDIM.
  • Core block repetition and channel dimensions as determined empirically.

Gaussian Heatmap Configuration:

Velocity-aware Γ\Gamma yields higher performance than fixed-radius alternatives (velocity-aware: 87.5 PDMS vs. fixed r=5mr=5\,\text{m}: 87.2, r=25mr=25\,\text{m}: 85.7).

5. Experimental Evaluation

NAVSIM evaluation with the Planning Distance Metric Score (PDMS) demonstrates TB-DiT’s efficacy:

Method Percep. Annotation Anchor-free PDMS
TrajDiff (C+L) × 87.5
TrajDiff* (w/ scaling) × 88.5
LAW × 83.8
DiffusionDrive × 88.1
WoTE × 88.3
Transfuser-DP 85.7
World4Drive × × 85.1

Notably, TrajDiff/TB-DiT surpasses all prior annotation-free frameworks by at least 2 PDMS points, and with data scaling reaches parity with the best perception-annotated diffusion baselines (Gui et al., 30 Nov 2025).

Ablation Results:

  • Full TB-DiT (with TrajBEV, Ego-BEV interaction, and cross-attention): 87.5 PDMS.
  • Removing TrajBEV or self-supervised heatmap supervision significantly degrades performance (to 81.6 PDMS).
  • Isolating ego-BEV interaction or cross-attention yields intermediate gains (up to 87.1 PDMS).

Data Scaling:

Combining initial-point resampling and increased data yields 88.5 PDMS, approaching fully supervised models.

6. Key Distinctions and Significance

TB-DiT introduces several defining characteristics compared to earlier diffusion-based planning and prediction frameworks:

  • Perception annotation-free: Trajectory compliance is enforced using a proxy heatmap loss, obviating the need for semantic maps or dense pixel-level supervision.
  • Anchor-free, diverse generation: No handcrafted motion anchors; plausible modes emerge directly via the self-supervised structure of the BEV heatmap and trajectory diffusion process.
  • Ego-context fusion: The explicit cross-attention between ego query and BEV features improves conditioning fidelity, verified by quantitative ablation.

A plausible implication is that TB-DiT provides a scalable blueprint for deploying diffusion-based planners in domains and geographies lacking semantic or perception annotations, with robustness to data scaling.

TB-DiT can be contrasted with contemporaneous works such as TopoDiffuser (Xu et al., 1 Aug 2025), which incorporate explicit topometric maps in BEV-encoded conditioning for trajectory generation, leveraging an auxiliary road segmentation loss for geometric compliance. TB-DiT, in contrast, eliminates such supervision, instead transferring environmental regularities through a trajectory-aligned Gaussian heatmap. While TopoDiffuser achieves strong results on multimodal trajectory prediction (e.g., KITTI), TB-DiT demonstrates competitive or superior performance in direct planning settings, particularly when large-scale, annotation-free data is utilized.

In summary, the Trajectory-oriented BEV Diffusion Transformer (TB-DiT) constitutes a principled, self-supervised blueprint for BEV-conditioned generative trajectory planning, establishing state-of-the-art results in perception annotation-free, end-to-end autonomous driving (Gui et al., 30 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory-oriented BEV Diffusion Transformer (TB-DiT).