Papers
Topics
Authors
Recent
Search
2000 character limit reached

Depth Prior Fusion Module (DPFM)

Updated 18 January 2026
  • DPFM is a fusion mechanism that integrates depth priors from diverse sources with RGB cues to improve scene structure and image restoration.
  • It employs techniques such as dual cross-attention, confidence-based weighting, and global/local alignment to merge appearance and geometry effectively.
  • Empirical results show DPFM enhances PSNR in dehazing, reduces depth errors in autonomous driving, and boosts detection in camouflaged scenes.

A Depth Prior Fusion Module (DPFM) is a neural network sub-component for integrating scene depth priors—either derived from monocular/single-image predictors, multi-view geometry, or explicit depth sensors—with RGB or other modality features to enhance scene understanding, restore degraded images, or increase geometric fidelity. Although the specifics of DPFM design and application vary by task, the core principle is learnable, per-pixel adaptation: fusing depth-based information with appearance cues in a spatially and contextually aware manner. The following sections summarize the principal DPFM formulations, their architectures, and impact across several application domains as reflected in recent research.

1. Conceptual Foundations and Motivation

Scene depth encodes intrinsic geometric structure that is poorly inferred from RGB images alone due to ambiguities in illumination, object texture, and occlusion. Single-view predictors offer robust, semantics-driven relative depth but lack metric scale and often exhibit low spatial granularity. Multi-view geometry (stereo, structure-from-motion, depth-from-focus, or LiDAR) can yield metric-accurate depth but fails where image evidence or pose accuracy is poor. DPFMs address these deficits by leveraging the complementary strengths of each depth source: they dynamically select, refine, or blend modalities to optimize structural fidelity, robustness to appearance or sensory degradation, and domain adaptability.

In image enhancement and restoration, depth priors inform the physical interaction of scene structure with factors such as haze or camouflage, enabling models to disambiguate color attenuation from scene content (Zuo et al., 11 Jan 2026, Liua et al., 2024). In depth estimation itself, fusion with DPFMs enables robust generalization to challenging visual conditions and mitigates failure in textureless or dynamic regions (Cheng et al., 2024, Ganj et al., 2024).

2. Core Architectures Across Domains

2.1. Dual Cross-Attention Sliding-Window DPFM for Dehazing

UDPNet’s DPFM computes hierarchical integration of multi-scale image and depth features via dual sliding-window multi-head cross-attention (Zuo et al., 11 Jan 2026). For each encoder stage and windowed spatial region:

  • An RGB feature map XRH×W×CX \in \mathbb{R}^{H \times W \times C} and a (projected) depth feature map FDRH×W×CF_D \in \mathbb{R}^{H \times W \times C} are extracted and zero-padded for precise window tiling.
  • Each non-overlapping query window is paired with overlapping key/value regions.
  • Two cross-attention branches are computed per window: “Depth→RGB” (depth as query, image as key/value) and “RGB→Depth” (the converse).
  • Multi-head cross-attention is formulated as:

$\begin{align*} \mathrm{Attn}_{D\to X} &= \mathrm{Concat}_{i}\left( \mathrm{Softmax}\left( \frac{Q_D^{(i)}{K_X^{(i)^\mathsf{T}}{\sqrt{d}} + B^{(i)} \right) V_X^{(i)} \right) \ \mathrm{Attn}_{X\to D} &= \mathrm{Concat}_{i}\left( \mathrm{Softmax}\left( \frac{Q_X^{(i)}{K_D^{(i)^\mathsf{T}}{\sqrt{d}} + B^{(i)} \right) V_D^{(i)} \right) \end{align*}$

  • Per-window output is summed and merged with the original:

Yw=AttnDX+AttnXD+XwY_w = \mathrm{Attn}_{D\to X} + \mathrm{Attn}_{X\to D} + X_w

  • Global merging, two-layer FFN with residual connections, and normalization complete the operation.

2.2. Adaptive Confidence-Based Fusion for Depth Estimation

In autonomous driving, DPFM is instantiated as a pixel-wise, confidence-weighted combination of single- and multi-view depth branches (Cheng et al., 2024):

  • Each pixel’s fused depth is computed via

Dfused(x)=C(x)Dmv(x)+(1C(x))Dsv(x)D_{fused}(x) = C(x) \cdot D_{mv}(x) + (1 - C(x)) \cdot D_{sv}(x)

where C(x)C(x) is a predicted gate reflecting relative confidence in the multi-view versus single-view estimate.

  • The confidence map CC integrates sigmoided outputs from the single-view branch, the multi-view photometric cost volume, and a “warping confidence” from photometric consistency between warped source and reference images.
  • All confidences are processed through learned 3×3 conv layers and a final sigmoid.

2.3. Cross-Attention with Depth Reliability Weighting for RGB-D

In camouflaged object detection, the analogous fusion module (called DCF) establishes per-channel adaptive weighting for depth features via channel and spatial attention and an explicit confidence map QdQ_d (Liua et al., 2024):

  • RGB and depth features are fused cross-modally via spatial and channel gates.
  • A Deep Attention Weighting block produces QdiQ^i_d, a per-channel scalar, by computing softmax similarity between projected RGB and depth features and global average pooling, followed by an MLP.
  • The final fused feature at stage ii is

xfi=QdiD+Rx^i_f = Q^i_d \odot D’ + R’

where DD’ and RR’ are cross-attended depth and color features.

2.4. Closed-Form and Local-Refinement Fusion

HybridDepth’s fusion computes a global, closed-form linear alignment (scale and shift) between a single-image relative depth and a metric depth-from-focus prediction, followed by a learned local scale refinement network (Ganj et al., 2024):

  • The global alignment solves via least squares:

minS,S^Mn(SDn+S^)22\min_{S,\hat{S}} \,\,\, \lVert M_n - (S D_n + \hat{S}) \rVert_2^2

  • The refined output is produced by an encoder-decoder network, consuming the globally aligned depth and a per-pixel local scale map Ls=Mn/(SDn+S^)L_s = M_n / (S \cdot D_n + \hat{S}).

3. Mathematical Formulation and Pseudocode

The core DPFM implementations are characterized by spatially structured fusion, dual-branch attention, and/or confidence-weighted blending. The following table summarizes the practical forms:

Domain Fusion Equation / Operation Adaptive Mechanism
Dehazing (Zuo et al., 11 Jan 2026) Dual windowed cross-attention, Yw=AttnDX+AttnXD+XwY_w = \mathrm{Attn}_{D\to X} + \mathrm{Attn}_{X\to D} + X_w Dual cross-attention, local window
Depth estimation (Cheng et al., 2024) Dfused=CDmv+(1C)DsvD_{fused} = C \cdot D_{mv} + (1-C) \cdot D_{sv} Confidence map, conv + sigmoid
COD (Liua et al., 2024) xfi=QdiD+Rx^i_f = Q^i_d \odot D’ + R’ Channel&spatial attention, DAW
HybridDepth (Ganj et al., 2024) Dmetric=SDn+S^D_{metric} = S \cdot D_n + \hat{S} (global); D=Dmetric+ΔSD^* = D_{metric} + \Delta S Global alignment + local network

Detailed pseudocode for the sliding-window dual cross-attention DPFM and the autonomous-driving confidence-based fusion appears in (Zuo et al., 11 Jan 2026) (code block under point 4) and (Cheng et al., 2024) (DPFM: AFNet, code under point 8).

4. Empirical Ablation and Quantitative Impact

Ablation studies demonstrate that DPFMs provide statistically significant gains across application domains by improving both global accuracy and robustness to adverse conditions.

  • In UDPNet (dehazing), DPFM yields up to 1.10 dB PSNR gain on Haze4K over RGB-only baseline, and 0.24 dB beyond channel attention alone (Zuo et al., 11 Jan 2026).
  • In AFNet (depth estimation), DPFM improves SqRel error on DDAD from 1.78 (multi-view only) to 1.55 (fused), and AbsRel from 0.104 to 0.088. Under pose noise, DPFM maintains strong performance where MVS-only baselines catastrophically fail (Cheng et al., 2024).
  • In camouflaged object detection, DCF (functionally a DPFM) improves S-measure by 2–3% absolute, and removing attention or depth weighting collapses gains (Liua et al., 2024).
  • In HybridDepth, the combination of global alignment and local refinement (functionally equivalent to a scale-shift DPFM) reduces RMSE and AbsRel error by 20–40% compared to naive global or single-branch alternatives (Ganj et al., 2024).

5. Integration Strategies and Design Choices

DPFMs are typically inserted at encoder stages, after initial feature extraction has occurred but before deep semantic aggregation. This leverages depth priors for spatially-resolved decisions and preserves high-frequency information for later recovery. Key design axes include:

  • Window size and overlap (UDPNet): M=8M=8, r=0.5r=0.5, overlapping attention windows balance local-global context.
  • Attention heads: Performance saturates at two heads; relative-position bias per head is maintained for spatial modeling.
  • Confidence estimation: In AFNet, integrating multiple confidence sources (single-view, multi-view, and warping) ensures robust gating.
  • Depth pre-processing: In all settings, depth priors are downsampled or projected to match the spatial and channel size of image features, typically via learned convolutional layers.

6. Implementation Efficiency and Training Regimes

DPFMs are constrained by the need to maintain spatial fidelity while minimizing compute overhead:

  • Overlapping window attention and batch-wise PyTorch unfold/fold secure tractable O(HWMMov)\mathcal{O}(HW \cdot M \cdot M_{ov}) scaling in large images (Zuo et al., 11 Jan 2026).
  • Frozen pretrained monocular depth methods (e.g., DepthAnything V2 (Zuo et al., 11 Jan 2026), MiDaS v3.1 (Liua et al., 2024)) act as fixed prior extractors, reducing end-to-end training cost.
  • Optimizers and hyperparameters are tailored by task, with AdamW and one-cycle or stage-wise LR schedules being predominant.
  • Skip connections and exclusion of DPFM from the decoder mitigate excessive low-frequency blending and loss of localization precision.

7. Illustrative Results and Qualitative Effects

Qualitatively, DPFM-driven architectures consistently recover scene geometry and semantic consistency in regions disrupted by adverse visual conditions (haze, camouflage, dynamic objects). In dehazing, DPFMs reconstruct fine structure lost with RGB-only backbones, reflected in both visual fidelity and PSNR/SSIM metrics. In depth estimation for autonomous driving, fused outputs display continuity across dynamic or textureless regions where single-branch models produce holes, discontinuities, or inaccurate object contours. Camouflaged object detection networks with DCF accurately localize and outline camouflaged targets by exploiting complementary depth cues suppressed by the adaptive per-channel gate (Liua et al., 2024).


Collectively, DPFMs comprise a class of mechanisms founded on learnable spatial fusion of depth and appearance priors, providing empirically validated increases in robustness and accuracy across image restoration, scene understanding, and geometric reconstruction tasks (Cheng et al., 2024, Liua et al., 2024, Ganj et al., 2024, Zuo et al., 11 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Depth Prior Fusion Module (DPFM).