WidthFormer: Toward Efficient Transformer-based BEV View Transformation

Published 8 Jan 2024 in cs.CV | (2401.03836v5)

Abstract: We present WidthFormer, a novel transformer-based module to compute Bird's-Eye-View (BEV) representations from multi-view cameras for real-time autonomous-driving applications. WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy. We first introduce a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information, which enables our model to compute high-quality BEV representations with only a single transformer decoder layer. This mechanism is also beneficial for existing sparse 3D object detectors. Inspired by the recently proposed works, we further improve our model's efficiency by vertically compressing the image features when serving as attention keys and values, and then we develop two modules to compensate for potential information loss due to feature compression. Experimental evaluation on the widely-used nuScenes 3D object detection benchmark demonstrates that our method outperforms previous approaches across different 3D detection architectures. More importantly, our model is highly efficient. For example, when using $256\times 704$ input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 computation solutions. Furthermore, WidthFormer also exhibits strong robustness to different degrees of camera perturbations. Our study offers valuable insights into the deployment of BEV transformation methods in real-world, complex road environments. Code is available at https://github.com/ChenhongyiYang/WidthFormer .

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents an efficient single-layer transformer approach for BEV view transformation that compresses image features to width-focused embeddings.
It introduces Reference Positional Encoding (RefPE) to accurately encode rotation and distance, enhancing spatial integrity in 3D detections.
Evaluations on the nuScenes dataset demonstrate improved mAP and latency performance, underscoring its potential in autonomous driving.

WidthFormer: Toward Efficient Transformer-based BEV View Transformation

Introduction

The paper "WidthFormer: Toward Efficient Transformer-based BEV View Transformation" (2401.03836) introduces an innovative approach for Bird's-Eye-View (BEV) transformation in the context of real-time 3D detection for autonomous vehicles. Traditional methods for BEV transformation typically fall into attention-based approaches or Lift-Splat frameworks, often burdened by computational intensity and deployment complexity. WidthFormer aims to overcome these limitations through a single-layer transformer architecture together with a novel positional encoding strategy.

Key Contributions

Efficient Transformer-Based BEV Transformation

The WidthFormer method is centered around reducing computational overhead by compressing image features into width-focused embeddings rather than relying on traditional height and width feature maps, reducing the dimensional data it processes. By applying a single-layer transformer decoder model, WidthFormer capitalizes on Reference Positional Encoding (RefPE) to maintain high performance while reducing processing load and the necessity for non-standard operations.

Reference Positional Encoding (RefPE)

RefPE is a critical advancement in this paper, presenting a robust means of preserving 3D geometrical integrity in image embeddings by encoding rotation and distance information distinct to each BEV query. By aggregating directional and spatial data into a refined encoding step, significant improvements in spatial accuracy are reported.

Figure 1: Reference Positional Encoding (RefPE) showcases the inclusion of rotation and distance components, enhancing sparse 3D detectors.

Robustness and Application to Autonomous Driving

The authors test WidthFormer and its variant BEVFormer under various conditions involving degrees of camera perturbation using the 6DoF framework. The methods show a strong balance of real-time efficacy and robustness against disturbances typical in dynamic driving scenarios. Evaluation using the nuScenes dataset demonstrates that WidthFormer not only achieves a significant performance edge over existing methods but does so with improved latency characteristics.

Figure 2: Robustness comparison of different VT methods under 6DoF camera perturbations.

Methodology

Overview of WidthFormer Architecture

WidthFormer introduces an architecture where the transformation from image to BEV space is facilitated via a single transformer decoder layer. This decoder attends to compressed features, specifically 'width' focused ones derived from pooling the height dimension of image features, to minimize the information loss and computation requirements typical in BEV transformations.

Figure 3: WidthFormer takes multi-view images and refines them via a transformer designed to suit compressed feature dimensions.

Feature Compressing and Compensation Techniques

WidthFormer employs image feature compression to width elements by pooling over height. To mitigate the information compromise usually associated with such compression, the Refine Transformer module is applied, refining and retrieving additional contextual data from the image space effectively.

Experimental Benchmark

Experiments on the nuScenes dataset underscore the benefits of WidthFormer where performance gains were reported both in terms of mean Average Precision (mAP) and reduced latency under computational edge limitations. The method achieves notable mAP improvements while maintaining strong downstream effects on NDS, a key metric for detection quality.

Conclusion

WidthFormer provides a significant step forward in real-time BEV view transformation by showcasing how efficient transformer structures and positional encoding can combine to enhance the applicability of autonomous driving frameworks. Future research could build upon WidthFormer by expanding on its current efficiency gains, potentially applying these concepts to broader perceptual tasks.

Overall, WidthFormer makes a substantive contribution to the field of real-time 3D object detection by delivering a streamlined and effective methodology to process multi-view images into coherent BEV representations with lower latency and enhanced robustness.

Markdown Report Issue