A Multimodal Hybrid Late-Cascade Fusion Network for Enhanced 3D Object Detection

Published 25 Apr 2025 in cs.CV, cs.AI, and cs.RO | (2504.18419v1)

Abstract: We present a new way to detect 3D objects from multimodal inputs, leveraging both LiDAR and RGB cameras in a hybrid late-cascade scheme, that combines an RGB detection network and a 3D LiDAR detector. We exploit late fusion principles to reduce LiDAR False Positives, matching LiDAR detections with RGB ones by projecting the LiDAR bounding boxes on the image. We rely on cascade fusion principles to recover LiDAR False Negatives leveraging epipolar constraints and frustums generated by RGB detections of separate views. Our solution can be plugged on top of any underlying single-modal detectors, enabling a flexible training process that can take advantage of pre-trained LiDAR and RGB detectors, or train the two branches separately. We evaluate our results on the KITTI object detection benchmark, showing significant performance improvements, especially for the detection of Pedestrians and Cyclists.

Abstract PDF Upgrade to Chat

Summary

A Multimodal Hybrid Late-Cascade Fusion Network for Enhanced 3D Object Detection

The paper introduces a novel approach to 3D object detection using a multimodal system that combines LiDAR and RGB cameras, enhancing detection accuracy through a hybrid late-cascade fusion network. This methodology addresses the critical requirements of autonomous vehicles for accurately detecting pedestrians, cyclists, and cars at varying distances and in complex environments, leveraging the strengths and compensating for the weaknesses of both LiDAR and camera data.

Methodology Overview

The proposed system intertwines two essential fusion strategies: late fusion and cascade fusion. Late fusion is employed to mitigate LiDAR false positives by associating LiDAR-based detections with RGB camera-generated detections. This is achieved by projecting LiDAR bounding boxes onto RGB images and matching them using IoU optimization. Meanwhile, cascade fusion principles are utilized to recuperate LiDAR false negatives. The method focuses on using RGB detections from stereo camera views with epipolar constraints to identify missed LiDAR detections.

The system is composed of several components:
- RGB and LiDAR branches provide initial 3D and 2D object detections.
- Bounding Box Matching module maximizes the IoU between projected LiDAR detections and RGB detections, filtering out false positives.
- Detection Recovery module leverages unmatched RGB detections to form frustum proposals, which are processed by a Frustum Localizer to generate 3D bounding boxes for missed objects.
- Semantic Fusion module validates the semantic consistency between modalities, primarily relying on RGB data for accurate labeling.

Experimental Results

Evaluations conducted on the KITTI object detection benchmark reveal substantial improvements, especially in detecting challenging classes such as pedestrians and cyclists. The proposed system achieved notable performances in 3D Average Precision (AP) and BEV AP over multiple difficulty levels compared to single-modal and existing multimodal detectors, underscoring the system's efficacy in reducing false negatives and improving recall.

Implications and Future Work

The integration of cascade and late fusion mechanisms provides flexibility and achieves high recall rates for small and distant objects, which are typically challenging for LiDAR detectors alone. The modularity of the approach allows for adaptability in training, making use of pre-trained models without necessitating large-scale multimodal datasets. This can significantly influence practical applications where real-time processing speed is essential, as demonstrated by the relatively low computational overhead of the proposed modules.

Future developments could focus on three areas:
1. Exploring more efficient deep learning architectures to further reduce runtime latency without compromising detection accuracy.
2. Enhancing data sharing across sensor modalities to increase robustness in diverse environmental conditions.
3. Expanding research to datasets with varied sensor setups to validate the generalizability of the proposed fusion strategy.

The multimodal hybrid late-cascade fusion system represents a significant advancement in autonomous vehicle technology, potentially extending its utility to broader applications requiring high precision and reliability in real-time object detection.

Conclusion

This paper presents a comprehensive and effective approach to leveraging multimodal sensory data for enhanced 3D object detection. The hybrid fusion strategy not only improves detection accuracy but also maintains high computational efficiency, presenting a viable solution for complex and dynamic settings such as autonomous driving.