DeCoDet: Depth-Conditioned Detector
- The paper introduces a modular detector that integrates a multi-scale depth-aware detection head and a depth-cue condition kernel, delivering state-of-the-art mAP improvements.
- It employs a novel scale-invariant refurbishment loss to supervise noisy pseudo-depth maps, ensuring consistent depth estimation across scales.
- The architecture is plug-and-play, enhancing both one-stage and two-stage detectors, and is validated on synthetic and real-world hazy drone imagery.
A Depth-Conditioned Detector (DeCoDet) is a neural object detection architecture that directly integrates depth-estimation into the detection pipeline for robust performance under challenging visual degradations, specifically haze, as encountered in drone-based aerial imagery. DeCoDet is distinguished by its multi-scale depth-aware design, the use of scale-invariant loss functions for supervising noisy depth estimates, and a Depth-Cue Condition Kernel for dynamically modulating features via localized, depth-dependent convolutions. Designed for application and benchmarking on the HazyDet dataset, DeCoDet has demonstrated state-of-the-art performance on both synthetic and real-world hazy test scenarios and provides a modular, plug-and-play enhancement for conventional one-stage or two-stage object detectors (Feng et al., 2024).
1. Architectural Components and Workflow
DeCoDet follows a modular detector design in which standard backbone networks and Feature Pyramid Networks (FPNs) provide multi-scale visual features, while additional specialized components handle the fusion and exploitation of depth cues:
- Backbone + FPN: Any off-the-shelf detector backbone may be used; the reference implementation utilizes ResNet-50 with FPN to output multi-resolution feature maps .
- Multi-Scale Depth-Aware Detection Head (MDDH): For each scale , stacked Conv+BN+ReLU layers produce intermediate features . A convolution at the output of each scale head yields a single-channel pseudo-depth map .
- Scale-Invariant Refurbishment Loss (SIRLoss): Supervises by enforcing consistency with pseudo-depth ground truth while mitigating the negative impact of label noise.
- Depth-Cue Condition Kernel (DCK): Dynamically generates a convolutional kernel for each spatial location using a hypernetwork; these kernels modulate the detection features in a depth-aware, spatially adaptive manner.
- Detection Heads: Final classification, regression, and centerness sub-heads operate on the DCK-modulated features to predict bounding boxes and class confidence.
The architecture is compatible with existing detector frameworks and can be plugged into different detection heads, including FCOS and VFNet, with empirically validated improvements in mean Average Precision (mAP).
2. Depth-Cue Condition Kernel: Formulation and Mechanism
Central to DeCoDet is the DCK, which enables fine-grained, location-specific adaptation of detection features based on estimated depth:
Let denote an MDDH intermediate feature map. For spatial position :
- Bottleneck projection:
where is the local feature vector, is a reduction ratio, and denotes BN+activation.
- Kernel generation (hypernetwork output):
defines a grouped convolutional kernel.
- Depth-aware convolution:
where is the input feature map and the sum is taken over the patch around . Grouped, spatially-varying convolution is realized in practice, and a residual connection is included to stabilize updates under depth uncertainty.
Ablation experiments identified and as optimal DCK hyperparameters for the tested settings.
3. Scale-Invariant Refurbishment Loss (SIRLoss)
Supervision of the intermediate depth maps produced by MDDH is achieved with SIRLoss, addressing the dual challenge of variability in scale and label noise:
- Scale-invariant depth error (as in Eigen et al., NIPS 2014):
- Label refurbishment (to mitigate depth label noise):
- Final SIRLoss:
This loss enforces that the detector's internal depth estimates remain scale-consistent across spatial locations while smoothing out artifacts from unreliable pseudo-depth maps.
4. Training Protocols and Dataset Utilization
DeCoDet is primarily validated on the HazyDet benchmark, which includes both large-scale synthetic and curated real-world hazy drone imagery:
- Synthetic HazyDet: 8,000 train / 1,000 val / 2,000 test images (264K / 34K / 65K boxes).
- Real-hazy RDDTS: 600 test images (~19K boxes) used exclusively for zero-shot evaluation.
Key training settings:
- Input: 1,333 × 800 resizing, random horizontal flip (probability 0.5).
- Optimization: SGD for 12 epochs, initial LR = 0.01, step decay at epochs 8 and 11, batch size = 2, weight decay = 1e-4, momentum = 0.938.
- Depth supervision: Pseudo-depth annotations derived from monocular models (Metric3D v2 found optimal) are used as for SIRLoss.
No explicit progressive domain adaptation or curriculum training is proposed in DeCoDet. Synthetic-to-real transfer is achieved by jointly training on synthetic data and reporting zero-shot mAP on RDDTS.
5. Empirical Results and Ablation Analysis
Quantitative evaluation demonstrates the efficacy of DeCoDet in hazy scenarios, with controlled ablation illuminating the contribution of key modules:
| Detector Variant | Test-set mAP | RDDTS mAP |
|---|---|---|
| Baseline FCOS (ResNet-50, synthetic haze) | 45.9 | 22.8 |
| +MDDH | 46.2 | – |
| +MDDH+SIRLoss | 46.4 | – |
| +MDDH+DCK | 46.9 | – |
| Full DeCoDet (MDDH+DCK+SIRLoss) | 47.4 | 24.3 |
Full DeCoDet provides a +1.5 mAP improvement over baseline in both synthetic and real haze, without requiring paired clear/hazy images or joint restoration–detection pipelines.
Qualitative results (Grad-CAM) show that DeCoDet feature responses are robust against haze-induced artifacts, with tighter focus on true object regions across variable fog densities. Performance degradation when training depth with intentionally noisy labels (mAP drops 47.4 → 39.7) further underscores the importance of accurate pseudo-depth supervision.
6. Generalization, Plug-and-Play Nature, and Limitations
DeCoDet demonstrates generality by boosting other one-stage detection architectures (e.g., VFNet) by 0.3–1.5 mAP, confirming its role as a modular enhancement for a range of modern detectors (Feng et al., 2024). The architecture does not require paired data, explicit domain adaptation schedules, or costly restoration subnetworks.
A plausible implication is that future work may benefit from improved depth pseudo-label quality or domain-bridging techniques, as performance is sensitive to the reliability of these precomputed depth cues.
7. Significance and Broader Context
DeCoDet establishes a new approach for robust object detection in adverse visual environments by embedding multi-scale, depth-aware reasoning directly into the detector. It provides evidence that even with noisy supervision, scale-invariant, refurbished depth losses and local depth-driven feature modulations confer tangible improvements in both synthetic and real-world haze, bridging a critical gap in drone autonomy research. The associated HazyDet benchmark and open-source resources facilitate further exploration and benchmarking in this domain (Feng et al., 2024).