Papers
Topics
Authors
Recent
Search
2000 character limit reached

TransFuser v6: End-to-End Driving Policy

Updated 14 February 2026
  • TransFuser v6 is an end-to-end driving policy that fuses multi-modal sensor inputs into a unified BEV representation using transformer self-attention.
  • It integrates cameras, LiDAR, and radar with a novel sensor fusion pipeline and enhanced navigational intent modeling for improved path planning.
  • Leveraging the LEAD expert policy for student-aligned imitation learning, TFv6 achieves state-of-the-art results on CARLA benchmarks and adapts to real-world driving.

TransFuser v6 (TFv6) is an advanced end-to-end driving policy that fuses multi-modal sensory input using a transformer-based birds-eye-view (BEV) architecture and is trained with a rigorously student-aligned expert, the LEAD policy. TFv6 achieves state-of-the-art closed-loop performance on multiple CARLA benchmarks and can be adapted for real-world driving tasks using latent representations and perception supervision. The LEAD system, with its focus on minimizing learner–expert asymmetries in visibility, uncertainty, and intent modeling, represents a key methodology shift in large-scale imitation learning for autonomous driving (Nguyen et al., 23 Dec 2025).

1. Architectural Innovations and Sensor Fusion

TFv6 retains and extends the core design principles of the TransFuser architecture, embedding a transformer-based fusion module to combine inputs from multiple sensors into a dense BEV token space for planning and control.

Sensor Fusion Pipeline

  • Cameras: Six front-and-side RGB cameras provide image frames, each processed through a convolutional backbone (ResNet-34 or RegNetY-032). View-frustum projection lifts each camera’s feature map into BEV space, resulting in N=H×WN = H \times W BEV tokens of dimensionality DD (e.g., D=256D=256).
  • LiDAR: Point clouds are voxelized and fed through a sparse 3D encoder, which also projects to the shared BEV token grid.
  • Radar (TFv6 addition): Four automotive radars yield up to 75 detections per frame. Detections (range, azimuth, radial velocity) are embedded with a lightweight learned encoder, then injected as object-level tokens directly into the planner’s cross-attention keys/values.
  • BEV Self-Attention: The joint tokens (camera, LiDAR, radar) are concatenated and passed through LL layers of transformer self-attention, generating a unified BEV representation T={t1,,tN}T = \{ t_1, \dots, t_N \}.

Decoder and Control Heads

  • Route Queries: MM learned query vectors QRM×DQ \in \mathbb{R}^{M \times D} attend across BEV tokens via cross-attention in KK transformer decoder layers.
  • Outputs: Each query produces a predicted path spline (or steering offset) through an MLP, and a separate “speed query” outputs the speed command.
  • Training Objective: The policy is trained end-to-end by minimizing a mean-squared imitation loss between student predictions and LEAD expert actions:

Lim=Eo,s[πθ(o)aexpert(s)2]L_{\text{im}} = \mathbb{E}_{o,s}[ \| \pi_\theta(o) - a_{\text{expert}}(s) \|^2 ]

  • Key Change from TFv5: Removal of the late GRU bottleneck (previously, route points passed only via a small GRU with limited information); TFv6 injects navigation tokens directly into the transformer stack (see Section 3).

2. Learner–Expert Asymmetry Interventions

TFv6 is trained with a new “student-centric” expert policy (LEAD) explicitly designed to minimize asymmetries between the privileged expert and the sensor-bound student during imitation learning.

2.1 Visibility Asymmetry

  • In prior PDM-Lite experts, forecasts considered all actors’ ground-truth boxes, regardless of camera occlusion or field of view.
  • LEAD restriction: Only actors with bounding boxes intersecting the student’s camera frustum FF are used:

Avis={a:box2D(a)F}A_{\text{vis}} = \{ a : \text{box}_{2D}(a) \cap F \neq \emptyset \}

  • Traffic lights and speed signs are masked to only those visible in the current frustum under current weather/time conditions.

2.2 Uncertainty Asymmetry

  • Collision prediction in PDM-Lite used perfect kinematic data; LEAD halts if hazards (based only on student-visible context) invade a safety margin dsafed_{\text{safe}}, independent of kinematic certainty.
  • Speed targets are reduced under poor visibility via vtargetvtarget(1α)v_{\text{target}} \leftarrow v_{\text{target}} \cdot (1 - \alpha), α[0,0.3]\alpha \in [0, 0.3].
  • 3D actor boxes for oncoming vehicles at unprotected turns are inflated by Δm\Delta m before collision checking.

2.3 Imitation-Only Training

  • All modifications exist solely in the expert’s rules; the student loss remains strictly imitation with no auxiliary regularizers.
  • Improved dataset alignment yields substantial gains without architectural regularization.

3. Enhanced Navigational Intent Modeling

TFv6 resolves goal specification ambiguities by replacing TFv5’s single-point-to-GRU path conditioning with a three-point route token strategy:

  • Token Formation: Previous, current, and next GNSS target points G={pt1,pt,pt+1}G = \{ p_{t-1}, p_t, p_{t+1} \} are normalized to [1,1][-1, 1] and embedded:

ei=W2ReLU(W1[pi])RDe_i = W_2 \cdot \text{ReLU}(W_1 \cdot [p_i]) \in \mathbb{R}^D

  • Token Injection: Route tokens et1,et,et+1e_{t-1}, e_t, e_{t+1} are prepended to the set of learned queries entering the decoder:

[Q; et1; et; et+1][Q;\ e_{t-1};\ e_t;\ e_{t+1}]

  • Effect: The transformer’s decoder layers attend jointly over route and BEV features, enabling both immediate and multi-step path reasoning, which mitigates goal-fixation behaviors and improves higher-level maneuvering (e.g., smoother lane changes).

4. Evaluation on Closed-Loop CARLA Benchmarks

TFv6 performance is evaluated under CARLA Leaderboard 2.0 standards using the Driving Score (DS), Success Rate (SR), and Normalized DS (NDS) as appropriate.

Benchmark TFv5 TFv6 (w/ Radar) Oracle/Expert (LEAD)
Bench2Drive (DS/SR) 83.5 / 67.3% 95.2 / 86.8% 96.8 / 96.6%
Longest6 v2 (DS/RC) 23.0 / 70% 62.0 / 91% 73.0 / 93%
Town13 (DS/NDS) 1.08 / 2.12 2.65 / 4.04 36.3 / 58.5
  • Bench2Drive: 12 point DS increase and nearly 20% SR gain over TFv5.
  • Longest6 v2: TFv6 more than doubles prior SOTA DS, notably reducing the gap to the privileged expert.
  • Town13 (unseen): TFv6 doubles DS and NDS compared to TFv5, indicating robust zero-shot generalization.

5. Sim-to-Real Transfer Using Latent TransFuser v6

TFv6’s architecture, minus LiDAR/radar, forms the Latent TransFuser v6 (LTFv6) policy for real-world camera-only evaluation.

5.1 Joint Perception Supervision

  • Pre-training with LEAD (CARLA synthetic) plus real-world panoramas (NAVSIM/WOD).
  • Auxiliary heads for semantic segmentation and object detection introduce a composite perception loss:

Lperc=λsegLseg+λdetLdetL_{\text{perc}} = \lambda_{\text{seg}} L_{\text{seg}} + \lambda_{\text{det}} L_{\text{det}}

5.2 Domain Adaptation and Curriculum

  • Curriculum: initial joint sim/real training (e.g., 30 epochs 50/50 mix), transitioning to real-only fine-tuning.
  • Photometric augmentation, sensor noise injection, and random weather variations are applied for domain robustness.

5.3 Real-World Results

Benchmark LTF (base) LTFv6 +LEAD pre-train
NAVSIM v1 83.8 85.4 86.4
NAVSIM v2 23.1 28.3 31.4
WOD-E2E 7.51 7.76

Consistent performance improvements are observed by switching from LTF to LTFv6 and further by leveraging LEAD pre-training. A plausible implication is that synthetic, student-aligned expert data substantially benefits real-world open-loop driving policy learning.

6. Significance and Outlook

TFv6 demonstrates that transformer-driven BEV fusion of cameras, LiDAR, and radar—when combined with expert data tailored to learner constraints—can yield near-oracle driving performance under simulated and real conditions. The architecture’s removal of train-test bottlenecks (e.g., late GRUs, weak intent conditioning) and the expert policy’s careful alignment to student observations set new benchmarks in robust, generalizable end-to-end driving. The approach provides a scalable, high-fidelity baseline for future research into transfer learning, long-horizon planning, and sensor fusion in autonomous vehicles (Nguyen et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TransFuser v6 (TFv6).