Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning

Published 14 Mar 2025 in cs.CV | (2503.11780v2)

Abstract: Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon\textemdash Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.

Abstract PDF Upgrade to Chat

Summary

The paper identifies fusion degradation in multi-modal object detection and proposes the M²D-LIF framework to enhance mono-modality feature learning.
It employs a teacher-student distillation approach alongside a brightness-aware fusion mechanism to optimize feature extraction and fusion.
Experiments on DroneVehicle, FLIR, and LLVIP datasets demonstrate state-of-the-art mAP improvements with a low parameter count.

This paper (2503.11780) addresses the issue of insufficient mono-modality feature learning in multi-modal object detection (MMOD), which leads to a phenomenon called "Fusion Degradation." The authors introduce a novel framework, M $^2$ D-LIF, comprising Mono-Modality Distillation (M $^2$ D) and Local Illumination-aware Fusion (LIF), to enhance mono-modality learning and achieve superior object detection performance.

Identifying Fusion Degradation

The authors identify a significant problem in MMOD: the "Fusion Degradation" phenomenon. This occurs when objects detectable by a mono-modal detector are missed by a multi-modal detector (Figure 1).

Figure 1: An illustration of the Fusion Degradation phenomenon, showing missed detections by multi-modal methods compared to mono-modal methods, along with statistics of its prevalence.

To investigate the underlying causes, the paper employs a linear probing evaluation. Mono-modal and multi-modal object detectors are trained, and their backbones are evaluated by freezing them and training new detection heads. The results indicate that multi-modal joint training leads to insufficient learning of each modality, which limits the overall detection performance (Figure 2).

Figure 2: Linear probing evaluation on the FLIR dataset, demonstrating the performance of different feature fusion methods.

The M $^2$ D-LIF Framework

To mitigate the fusion degradation phenomenon, the authors propose the M $^2$ D-LIF framework. This framework facilitates sufficient learning of mono-modality features during multi-modal joint training and employs a lightweight feature fusion approach. The M $^2$ D-LIF framework consists of two main components: Mono-Modality Distillation (M $^2$ D) and Local Illumination-aware Fusion (LIF) (Figure 3).

Figure 3: An overview of the M $^2$ D-LIF framework, highlighting the Mono-Modality Distillation (M $^2$ D) and Local Illumination-aware Fusion (LIF) components.

Mono-Modality Distillation (M $^2$ D)

M $^2$ D enhances feature extraction by using a teacher-student approach. A pre-trained mono-modal encoder distills knowledge to the multi-modal backbone network. The M $^2$ D method incorporates inner-modality and cross-modality distillation losses to optimize the framework during training. The inner-modality distillation loss $\mathcal{L}_{\text {IM}}$ aligns the multi-modal backbone with the feature responses of the teacher model:

$\mathcal{L}_{\text{IM}} = \text{D}(f_V, \widetilde{f}_V) + \text{D}(f_I, \widetilde{f}_I)$

where $D(\cdot,\cdot)$ denotes a distillation method, $f_V$ and $f_I$ are the outputs of the student backbones, and $\widetilde{f}_V$ and $\widetilde{f}_I$ are the outputs of the teacher backbones.

The cross-modality distillation loss $\mathcal{L}_{\text {CM}}$ leverages salient object location priors to guide feature distillation. An attention mechanism, specifically SimAM, extracts salient object feature attention maps, which serve as location priors. The attention map $\widetilde{\mathcal{M}}$ is calculated as:

$\widetilde{\mathcal{M}} = \text{Sigmoid}(\frac{(\widetilde{f}-\widetilde{\mu})^2+2\widetilde{\sigma} ^2 +2\lambda}{4(\widetilde{\sigma}^2 + \lambda)})$

The cross-modality feature distillation loss is formulated as:

$\mathcal{L}_{\text{CM}} = \text{D}(\widetilde{\mathcal{M}}_V\odot f_I, \widetilde{\mathcal{M}}_V\odot\widetilde{f}_V) + \text{D}(\widetilde{\mathcal{M}}_I\odot f_V, \widetilde{\mathcal{M}}_I\odot\widetilde{f}_I)$

where $\widetilde{\mathcal{M}}_V$ and $\widetilde{\mathcal{M}}_I$ are the attention maps of different modalities. The overall loss function of M $^2$ D is defined as the sum of the inner- and cross-modality loss:

$\mathcal{L}_{M^2D}= \mathcal{L}_{\text {IM}+\mathcal{L}_{\text {CM}}$

Local Illumination-aware Fusion (LIF)

LIF is a weighted-based fusion method that dynamically sets different weights for different illumination regions using a predicted brightness map. The brightness map $B$ is predicted using convolutional layers:

$B = ConvBlock(I_V)$

where $I_V$ is the RGB image. The loss function $\mathcal{L}_{LI}$ is the L2 norm between the predicted brightness map $B$ and the ground truth $\widetilde{B}$ (L channel in LAB color space):

$\mathcal{L}_{LI} = ||B, \widetilde{B}||_2$

The weight generation mechanism adaptively adjusts the weights of different modality features:

$\left\{ \begin{aligned} &W_V=\beta \times \min\mathrm{(}\frac{B-\alpha}{2\alpha},\frac{1}{2})+\frac{1}{2},\ &W_I=1-W_V,\ \end{aligned}\,\, \right.$

where $W_V$ and $W_I$ represent the weights of the RGB and infrared modalities, respectively, $\alpha$ is a threshold, and $\beta$ is the amplitude of $W_{V}$ . The final fused feature $f^i_F$ is represented as:

$f^i_F = \mathcal{F}(f_{V},f_{I}) = W^i_{V}\odot f^i_{V}+W^i_{I}\odot f^i_{I}$

The overall loss function is:

$\mathcal{L} = \mathcal{L}_{\mathrm{det}} + \lambda_{M^2D} \mathcal{L}_{M^2D} + \lambda_{LI}\mathcal{L}_{LI}$

where $\lambda_{M^2D}$ and $\lambda_{LI}$ are hyperparameters.

Experimental Results

Experiments were conducted on DroneVehicle, FLIR-aligned, and LLVIP datasets. Ablation studies demonstrate the effectiveness of both the M $^2$ D and LIF modules. Ablation studies on the hyper-parameter $\beta$ showed that a value of 0.4 achieved the best results (Figure 4).

Figure 4: A bar chart showing the impact of varying the hyperparameter $\beta$ on the performance of the M $^2$ D-LIF framework.

Visualization of detection results demonstrates that M $^2$ D-LIF effectively mitigates the Fusion Degradation phenomenon (Figure 5).

Figure 5: Visualizations of Fusion Degradation, comparing the detection results of various methods with M $^2$ D-LIF.

The visualization of the LIF weight map shows that the module effectively perceives illumination and assigns higher weights to regions with better lighting conditions (Figure 6).

Figure 6: Visualization of the weight map $W_V$ generated by the LIF module, showing its adaptation to local illumination conditions.

Comparison with state-of-the-art methods on the DroneVehicle dataset shows that M $^2$ D-LIF achieves the highest mAP $_{50}$ and mAP of 81.4% and 68.1%, respectively. On the FLIR and LLVIP datasets, M $^2$ D-LIF achieves 46.1% and 70.8% mAP, respectively, while maintaining a relatively low parameter count.

Conclusion

The paper (2503.11780) makes a compelling case for rethinking MMOD from a mono-modality learning perspective. The proposed M $^2$ D-LIF framework effectively addresses the fusion degradation phenomenon and achieves state-of-the-art performance on multiple datasets. The M $^2$ D component enhances mono-modal feature extraction, while the LIF module provides a lightweight yet effective fusion mechanism. This work opens avenues for future research in multi-modal learning, particularly in addressing modality-specific challenges and improving feature fusion strategies.