Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

Published 13 Jun 2025 in cs.CV, cs.LG, and cs.RO | (2506.12251v2)

Abstract: Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a triplane-based multi-camera tokenization method that reduces token count by up to 72% and inference latency by up to 50%.
It leverages 3D neural reconstruction and geometry-aware encodings to decouple tokens from image resolution and camera count for scalable AR Transformer deployment.
Experimental results on large-scale AV datasets demonstrate improved open- and closed-loop driving performance and robust scene reconstruction.

Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

Introduction

The proliferation of foundation models within the robotics and autonomous driving domains has created demand for policies that generalize robustly across diverse, real-world conditions. Autoregressive (AR) Transformers have emerged as the backbone for such policies, but efficient sensor data representation remains a key obstacle for real-time deployment, particularly when multi-camera setups yield high-dimensional observations that scale linearly with both camera count and image resolution. Conventional patch-based tokenization strategies induce significant computational and memory overheads, constraining model size and context length, thus limiting real-world feasibility.

This paper introduces a triplane-based multi-camera tokenization framework tailored for AR Transformer-driven end-to-end (E2E) autonomous vehicle (AV) planning. Building on recent advances in 3D neural reconstruction, the proposed approach encodes arbitrary multi-camera images into a geometry-aware, resolution- and camera-count-agnostic volumetric latent representation (triplanes), which is then patchified into tokens suitable for transformer ingestion. The primary claims are: (1) significant reduction (up to 72%) in the number of tokens relative to patch-based baselines; (2) commensurate or improved driving performance in open- and closed-loop metrics; and (3) up to 50% decrease in end-to-end inference latency.

Methodology

The core of the framework is a feedforward pipeline that encodes $N$ input camera images into three orthogonal, fixed-size triplane feature grids via an image encoder backbone (DINOv2-small or ResNet-50), followed by 3D-aware attention layers that leverage camera calibration for spatial alignment. Notably, the design explicitly decouples token count from both image resolution and the number of cameras.

The volumetric triplane representation admits self-supervised scalable training through photometric losses (LPIPS and $\ell_1$ ), casting image reconstruction as a volumetric rendering problem. This eliminates the brittleness of adversarial losses common to autoencoder-based tokenizers and ensures robustness across a wide variety of driving scenes.

Tokenization is realized by patchifying each triplane into a pre-specified grid of 1D vectors (continuous tokens) via a linear projection. The overall scheme is illustrated in (Figure 1).

(Figure 1)

Figure 1: The triplane encoder tokenizes arbitrarily many multi-camera views into geometry-aware patches decoupled from input resolution or camera count.

Unlike ViT- or VQGAN-based tokenization, the triplane approach supports adaptive patchification post-training, making it amenable to dynamic inference-time performance-tuning. The approach is also compatible with discrete vector quantization (FSQ), but empirical evaluations indicate that continuous tokens outperform discrete ones in downstream driving metrics.

Experimental Results

Comprehensive evaluation is conducted on a proprietary 20,000-hour AV dataset (1700+ cities, 25 countries, 7 cameras per vehicle), as well as the nuScenes and Waymo Open Dataset (WOD) planning benchmarks.

Reconstruction and Representation Quality: Triplanes demonstrate high-fidelity neural rendering, achieving 28.9 dB PSNR and 0.84 SSIM for 7-camera setups—comparable to, or exceeding, state-of-the-art neural rendering results (see also (Figure 2)).

(Figure 2)

Figure 2: Visualization of triplane-based scene encoding and reconstructions with semantic PCA feature maps.

Open-Loop Planning: With a 1B-parameter AR Transformer backbone, triplane tokenization (using as few as 45 tokens per image) matches or slightly outperforms strong baselines (VQGAN, DINOv2-small) on motion prediction metrics (minADE $_6$ at horizons of 1s, 3s, 5s). Aggressive patchification further reduces token count without deteriorating performance.

Closed-Loop Simulation: In challenging simulate-and-deploy scenarios, the triplane model attains a lower offroad rate (2.7% at 20s) compared to the DINOv2 baseline while benefiting from dramatically reduced inference time (up to 50% faster). Critically, triplane models exhibit improved policy robustness in obstacle-dense environments (see (Figure 3)), capturing subtle spatial cues forfeited by 2D patch-based representations.

(Figure 3)

Figure 3: Qualitative comparison: Triplane-based policy executes robust navigation around obstacles in closed-loop simulation, outperforming the DINOv2 patch-based model.

Inference Scaling and Resolution/Cam Count Agnosticism: Profiling demonstrates that the triplane framework permits scaling context lengths, number of cameras, and image resolution without increasing token sequence lengths, thus enabling real-time inference with large (7B+) backbones under realistic AV deployment constraints (see (Figure 4)).

(Figure 4)

Figure 4: Inference latency scales sublinearly with both number of cameras and context frames for triplane tokenization; baseline methods exhibit prohibitive scaling.

Implications and Future Directions

The geometry-aware, agnostic nature of triplane tokenization fundamentally addresses the quadratic resource scaling inherent to traditional visual tokenization regimes. This unlocks practical deployment of AR Transformer-based policies with larger models, longer sensory histories, and robust generalization across heterogeneous multi-camera setups—all without retraining.

From a theoretical perspective, this work encourages a shift toward 3D-anchored perceptual abstractions in embodied AI, highlighting the importance of inductive biases aligned with the physical environment. Practically, the reduction in inference time and tokens translates to lower energy consumption, higher throughput, and increased safety in next-generation AV stacks.

Several forward-looking initiatives are apparent:

Incorporation of temporal modeling within the triplane representation to better capture dynamic scene evolution.
Exploration of compressive methods for further memory reduction (e.g., hierarchical triplane architectures, adaptive quantization).
Integration of auxiliary signals (depth, semantics) for additional supervision, as preliminary results demonstrate compatibility with multi-task objectives.
Extensive real-world hardware deployment and benchmarking, leveraging hardware-optimized runtimes.

Conclusion

The presented triplane-based tokenization framework constitutes a scalable, efficient, and geometrically consistent approach for multi-camera sensor data abstraction in end-to-end AV policy stacks. Empirical evidence supports substantial improvements in token efficiency, end-to-end inference time, and driving performance over leading alternatives, without sacrificing generality or extensibility. The paradigm is positioned to facilitate the next stage of AR Transformer adoption in resource-constrained, safety-critical autonomous systems.

Markdown Report Issue