- The paper introduces Availability-aware Sensor Fusion (ASF), a new method for autonomous driving that improves robustness against sensor degradation by fusing features in a unified canonical space.
- ASF utilizes a unified canonical projection (UCP) for consistent feature representation across sensors, cross-attention with post-normalization (CASAP-PN) for availability awareness, and a sensor combination loss (SCL) for training robustness.
- Evaluated on the K-Radar dataset, ASF achieves significant improvements in object detection performance (9.7% AP_BEV, 20.1% AP_3D) and demonstrates superior robustness under adverse conditions and sensor degradation.
The paper introduces an availability-aware sensor fusion (ASF) method designed for autonomous driving applications, addressing the challenges of sensor degradation and failure by leveraging a unified canonical space. The key contributions of this work are the ASF framework, which combines unified canonical projection (UCP) and cross-attention across sensors along patches with post-feature normalization (CASAP-PN), and a sensor combination loss (SCL) function to optimize detection performance across various sensor combinations. The ASF is validated on the K-Radar dataset, demonstrating superior object detection performance compared to existing state-of-the-art fusion methods, especially under adverse weather conditions and sensor degradation scenarios.
The paper highlights the limitations of existing multi-modal sensor fusion methods, categorizing them into deeply coupled fusion (DCF) and sensor-wise cross-attention fusion (SCF). DCF methods directly combine feature maps extracted from sensor-specific encoders but are vulnerable to sensor degradation and require retraining when the number of sensors changes. SCF methods use cross-attention to selectively combine available patches, but they lack sensor scalability and incur high computational costs that scale quadratically with the number of patches. The fundamental limitation of these methods is the inconsistency in feature representation across different sensors, such as cameras, LiDAR, and 4D Radar.
To address these limitations, the ASF method projects features from each sensor into a unified canonical space. The UCP sub-module projects features from each sensor into a unified space based on common criteria, eliminating inconsistencies in sensor features. The CASAP-PN sub-module estimates the availability of sensors through patch-wise cross-attention on the unified features, assigning higher weights to features from available sensors and lower weights to those from missing or uncertain sensors. This approach reduces computational complexity from O(Ns2Np2) in SCF to O(Ns2Np), where Ns is the number of sensors and Np is the number of patches. The post-feature normalization ensures consistent processing of features by the detection head, regardless of the sensor combination.
The SCL is introduced to integrate availability awareness into the detection head, optimizing learning outcomes across all possible sensor combinations. This loss function considers individual sensor unavailability during training, enabling the system to maintain high performance in the presence of unexpected sensor failure or adverse weather conditions.
Experiments on the K-Radar dataset demonstrate that ASF achieves improvements of 9.7\% in APBEV (87.2\%) and 20.1\% in AP3D (73.6\%) for object detection at IoU=0.5, compared to state-of-the-art fusion methods. The method also exhibits robustness against sensor degradation and failure, dynamically adapting to different sensor combinations without retraining. Ablation studies validate the impact of key components such as patch size, channel dimension, the number of patches multiplier, the number of attention heads, and the SCL. Qualitative results show that ASF redistributes attention toward more reliable sensors in adverse weather conditions, maintaining near-optimal performance even with damaged sensors.
The sensor fusion framework consists of three stages: sensor-specific encoders, the ASF network, and a detection head. The paper adopts BEVDepth, SECOND, and RTNH backbones for camera, LiDAR, and 4D Radar, respectively, along with an SSD detection head. The sensor-specific encoders extract bird's-eye-view (BEV) feature maps (FMs) from each sensor's data.
The unified canonical projection (UCP) transforms sensor-specific patches into patches in a unified canonical space. The same-sized feature maps of each sensor are represented as FMs∈RCs×H×W, where Cs denotes the channel dimension for sensor s, and H and W represent the height and width of the BEV FMs for all sensors. Each FM is divided into patches Fps, and the UCP operation Us(⋅) transforms these patches into a unified canonical space, resulting in patches Pus.
Cross-attention across sensors along patches with post-feature normalization (CASAP-PN) uses the UCP-projected patches as keys (K) and values (V) for a trainable reference query Qref∈RNq×Cu, where Nq is the number of queries Q. Cross-attention is performed across sensors along patches as:
Qref,i′=CrossAttn(Q=Qref,K,V∈{Fu,iSC,Fu,iSL,Fu,iSR}),i=1:Np.
The post-feature normalization (PN) N is applied to ensure that features can be processed consistently by the detection head regardless of the sensor combination. The fused FM $\mathbf{FM}_{\mathrm{fused}$ is obtained by applying a reshape operation Tn(⋅) to the set of patches Pn with PN.
The sensor combination loss (SCL) is formalized as: $\mathcal{L}_{SCL} \!=\! \sum_{s \in \mathcal{SC} (\mathcal{L}_{cls,s} \!+\! \mathcal{L}_{reg,s})$, where SC represents the set of possible sensor combinations, and Lcls,s and Lreg,s are the classification and regression losses for each sensor combination.
Despite the contributions, the camera network's capabilities remain a limitation. Enhancing the camera backbone could further boost system performance, especially in favorable weather conditions.