Multimodal Object Detection via Probabilistic Ensembling

Published 7 Apr 2021 in cs.CV | (2104.02904v3)

Abstract: Object detection with multimodal inputs can improve many safety-critical systems such as autonomous vehicles (AVs). Motivated by AVs that operate in both day and night, we study multimodal object detection with RGB and thermal cameras, since the latter provides much stronger object signatures under poor illumination. We explore strategies for fusing information from different modalities. Our key contribution is a probabilistic ensembling technique, ProbEn, a simple non-learned method that fuses together detections from multi-modalities. We derive ProbEn from Bayes' rule and first principles that assume conditional independence across modalities. Through probabilistic marginalization, ProbEn elegantly handles missing modalities when detectors do not fire on the same object. Importantly, ProbEn also notably improves multimodal detection even when the conditional independence assumption does not hold, e.g., fusing outputs from other fusion methods (both off-the-shelf and trained in-house). We validate ProbEn on two benchmarks containing both aligned (KAIST) and unaligned (FLIR) multimodal images, showing that ProbEn outperforms prior work by more than 13% in relative performance!

Abstract PDF Upgrade to Chat

Citations (80)

View on Semantic Scholar

Summary

Multimodal Object Detection via Probabilistic Ensembling

Multimodal object detection has emerged as a crucial component in enhancing the reliability and functionality of safety-critical systems, such as autonomous vehicles (AVs), with the need for robust performance across varying environmental conditions, including day and night. The paper titled "Multimodal Object Detection via Probabilistic Ensembling" by Chen et al. presents an innovative approach to improve object detection through a fusion of RGB and thermal data. The approach, denoted as ProbEn, is a probabilistic ensembling technique grounded in Bayesian principles designed to handle the complexities of multimodal fusion effectively.

Summary of Main Contributions

The proposed ProbEn method is a non-learned ensembling strategy derived from Bayesian first principles. It leverages the assumption of conditional independence across modalities to fuse information optimally. ProbEn integrates detections from different modalities and enhances the robustness of detection performance, especially under suboptimal conditions where RGB or thermal modalities may fail individually. Furthermore, ProbEn elegantly addresses the challenge of missing modalities—when detectors from different data streams do not fire on the same object—through probabilistic marginalization.

The experimentation and validation process utilizes two benchmarks with different alignment characteristics: KAIST (aligned images) and FLIR (unaligned images). It is evidenced that ProbEn significantly outperforms existing fusion strategies by more than 13% in relative performance across these datasets. The results affirm the efficacy of ProbEn even in scenarios where the assumption of conditional independence may not hold, such as when fusing outputs from other trained fusion methods.

Implications

The implications of this work are multifaceted, spanning both the theoretical and practical domains. Theoretically, ProbEn provides a novel viewpoint on multimodal fusion, paving the way for further exploration into non-learned probabilistic approaches for integrating data from diverse sources. Practically, the enhancements in detection reliability and accuracy, most notably in night-time scenarios, have significant implications for the operational safety and efficacy of AV systems. The adaptability of ProbEn across varied environments highlights its potential for broader application beyond AVs to other systems requiring robust multimodal integration.

Prospects and Future Developments

While the work sets a strong precedent for detector ensembling as a method for multimodal fusion, it also opens avenues for future research to explore the integration of additional modalities beyond RGB and thermal. The potential for incorporating LiDAR, radar, and other sensor data could further enrich the information available for object detection systems. Additionally, exploring learned variations of probabilistic ensembling and extending the theory to account for correlated data streams could offer improvements in performance and applicability across different domains.

The open-source nature of the implementation provides further potential for collaborative advancement, allowing the fusion approach to be tested, scrutinized, and potentially improved by the broader research community. This continuous refinement and validation against real-world data will be crucial to ensuring the robustness of multimodal detection systems as they become more prevalent and essential in various applications.

In conclusion, the work by Chen et al. marks an important contribution to the field of computer vision and multimodal detection with ProbEn, offering promising insights and practical advantages that align with the evolving demands of modern safety-critical systems.