DeCoDet: Depth-Conditioned Detector

Updated 22 January 2026

The paper introduces a modular detector that integrates a multi-scale depth-aware detection head and a depth-cue condition kernel, delivering state-of-the-art mAP improvements.
It employs a novel scale-invariant refurbishment loss to supervise noisy pseudo-depth maps, ensuring consistent depth estimation across scales.
The architecture is plug-and-play, enhancing both one-stage and two-stage detectors, and is validated on synthetic and real-world hazy drone imagery.

A Depth-Conditioned Detector (DeCoDet) is a neural object detection architecture that directly integrates depth-estimation into the detection pipeline for robust performance under challenging visual degradations, specifically haze, as encountered in drone-based aerial imagery. DeCoDet is distinguished by its multi-scale depth-aware design, the use of scale-invariant loss functions for supervising noisy depth estimates, and a Depth-Cue Condition Kernel for dynamically modulating features via localized, depth-dependent convolutions. Designed for application and benchmarking on the HazyDet dataset, DeCoDet has demonstrated state-of-the-art performance on both synthetic and real-world hazy test scenarios and provides a modular, plug-and-play enhancement for conventional one-stage or two-stage object detectors (Feng et al., 2024).

1. Architectural Components and Workflow

DeCoDet follows a modular detector design in which standard backbone networks and Feature Pyramid Networks (FPNs) provide multi-scale visual features, while additional specialized components handle the fusion and exploitation of depth cues:

Backbone + FPN: Any off-the-shelf detector backbone may be used; the reference implementation utilizes ResNet-50 with FPN to output multi-resolution feature maps $P_1, ..., P_n$ .
Multi-Scale Depth-Aware Detection Head (MDDH): For each scale $P_n$ , $M$ stacked Conv+BN+ReLU layers produce intermediate features $F_n^m$ . A $1 \times 1$ convolution at the output of each scale head yields a single-channel pseudo-depth map $D_n$ .
Scale-Invariant Refurbishment Loss (SIRLoss): Supervises $D_n$ by enforcing consistency with pseudo-depth ground truth while mitigating the negative impact of label noise.
Depth-Cue Condition Kernel (DCK): Dynamically generates a $K \times K$ convolutional kernel for each spatial location using a hypernetwork; these kernels modulate the detection features in a depth-aware, spatially adaptive manner.
Detection Heads: Final classification, regression, and centerness sub-heads operate on the DCK-modulated features to predict bounding boxes and class confidence.

The architecture is compatible with existing detector frameworks and can be plugged into different detection heads, including FCOS and VFNet, with empirically validated improvements in mean Average Precision (mAP).

2. Depth-Cue Condition Kernel: Formulation and Mechanism

Central to DeCoDet is the DCK, which enables fine-grained, location-specific adaptation of detection features based on estimated depth:

Let $F_n^m \in \mathbb{R}^{H \times W \times C}$ denote an MDDH intermediate feature map. For spatial position $(i, j)$ :

Bottleneck projection:

$U_{i,j} = \sigma( W_0 X_{i,j} ), \quad W_0 \in \mathbb{R}^{(C/r) \times C}$

where $X_{i,j} \in \mathbb{R}^C$ is the local feature vector, $r$ is a reduction ratio, and $\sigma$ denotes BN+activation.

Kernel generation (hypernetwork output):

$H_{i,j} = W_1 U_{i,j}, \quad W_1 \in \mathbb{R}^{(K \cdot K \cdot G) \times (C/r)}$

$H_{i,j} \in \mathbb{R}^{K \times K \times G}$ defines a grouped convolutional kernel.

Depth-aware convolution:

$Y'_{i,j,k} = \sum_{(u, v) \in \Delta_K} H_{i,j, u + \lfloor K/2 \rfloor, v + \lfloor K/2 \rfloor, \lceil kG/C \rceil} \cdot Y_{i+u,j+v,k}$

where $Y \in \mathbb{R}^{H \times W \times C}$ is the input feature map and the sum is taken over the $K \times K$ patch around $(i, j)$ . Grouped, spatially-varying convolution is realized in practice, and a residual connection is included to stabilize updates under depth uncertainty.

Ablation experiments identified $K=7$ and $G=16$ as optimal DCK hyperparameters for the tested settings.

3. Scale-Invariant Refurbishment Loss (SIRLoss)

Supervision of the intermediate depth maps produced by MDDH is achieved with SIRLoss, addressing the dual challenge of variability in scale and label noise:

Scale-invariant depth error (as in Eigen et al., NIPS 2014):

$L_{si}(y, y^*) = \frac{1}{n}\sum_i d_i^2 - \frac{1}{n^2}\left(\sum_i d_i\right)^2, \quad d_i = \log y_i - \log y^*_i$

Label refurbishment (to mitigate depth label noise):

$\hat{y}_i = \alpha y^*_i + (1-\alpha) y_i, \quad \alpha = 0.9$

Final SIRLoss:

$L_{Dep}(y, y^*) = \frac{1}{n}\sum_i (d_i')^2 - \frac{1}{n^2}\left(\sum_i d_i'\right)^2, \quad d_i' = \log y_i - \log \hat{y}_i$

This loss enforces that the detector's internal depth estimates remain scale-consistent across spatial locations while smoothing out artifacts from unreliable pseudo-depth maps.

4. Training Protocols and Dataset Utilization

DeCoDet is primarily validated on the HazyDet benchmark, which includes both large-scale synthetic and curated real-world hazy drone imagery:

Synthetic HazyDet: 8,000 train / 1,000 val / 2,000 test images (264K / 34K / 65K boxes).
Real-hazy RDDTS: 600 test images (~19K boxes) used exclusively for zero-shot evaluation.

Key training settings:

Input: 1,333 × 800 resizing, random horizontal flip (probability 0.5).
Optimization: SGD for 12 epochs, initial LR = 0.01, step decay at epochs 8 and 11, batch size = 2, weight decay = 1e-4, momentum = 0.938.
Depth supervision: Pseudo-depth annotations derived from monocular models (Metric3D v2 found optimal) are used as $y^*$ for SIRLoss.

No explicit progressive domain adaptation or curriculum training is proposed in DeCoDet. Synthetic-to-real transfer is achieved by jointly training on synthetic data and reporting zero-shot mAP on RDDTS.

5. Empirical Results and Ablation Analysis

Quantitative evaluation demonstrates the efficacy of DeCoDet in hazy scenarios, with controlled ablation illuminating the contribution of key modules:

Detector Variant	Test-set mAP	RDDTS mAP
Baseline FCOS (ResNet-50, synthetic haze)	45.9	22.8
+MDDH	46.2	–
+MDDH+SIRLoss	46.4	–
+MDDH+DCK	46.9	–
Full DeCoDet (MDDH+DCK+SIRLoss)	47.4	24.3

Full DeCoDet provides a +1.5 mAP improvement over baseline in both synthetic and real haze, without requiring paired clear/hazy images or joint restoration–detection pipelines.

Qualitative results (Grad-CAM) show that DeCoDet feature responses are robust against haze-induced artifacts, with tighter focus on true object regions across variable fog densities. Performance degradation when training depth with intentionally noisy labels (mAP drops 47.4 → 39.7) further underscores the importance of accurate pseudo-depth supervision.

6. Generalization, Plug-and-Play Nature, and Limitations

DeCoDet demonstrates generality by boosting other one-stage detection architectures (e.g., VFNet) by 0.3–1.5 mAP, confirming its role as a modular enhancement for a range of modern detectors (Feng et al., 2024). The architecture does not require paired data, explicit domain adaptation schedules, or costly restoration subnetworks.

A plausible implication is that future work may benefit from improved depth pseudo-label quality or domain-bridging techniques, as performance is sensitive to the reliability of these precomputed depth cues.

7. Significance and Broader Context

DeCoDet establishes a new approach for robust object detection in adverse visual environments by embedding multi-scale, depth-aware reasoning directly into the detector. It provides evidence that even with noisy supervision, scale-invariant, refurbished depth losses and local depth-driven feature modulations confer tangible improvements in both synthetic and real-world haze, bridging a critical gap in drone autonomy research. The associated HazyDet benchmark and open-source resources facilitate further exploration and benchmarking in this domain (Feng et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

HazyDet: Open-Source Benchmark for Drone-View Object Detection with Depth-Cues in Hazy Scenes (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Depth-Conditioned Detector (DeCoDet).