Papers
Topics
Authors
Recent
Search
2000 character limit reached

DepthMaster: Deterministic LDM for Depth Estimation

Updated 31 January 2026
  • The paper introduces DepthMaster, a deterministic single-step latent diffusion model that achieves zero-shot monocular depth estimation with enhanced semantic and frequency features.
  • It integrates a Feature Alignment module using DINOv2 and a Fourier Enhancement module to improve edge fidelity and preserve high-frequency details in depth maps.
  • Empirical evaluations on datasets like KITTI and NYUv2 demonstrate superior generalization and precision through a two-stage training framework.

DepthMaster is a single-step deterministic adaptation of latent diffusion models explicitly designed for zero-shot monocular depth estimation. Operating within the diffusion-denoising paradigm, DepthMaster strategically incorporates semantic feature alignment and Fourier-domain detail enhancement within a two-stage training framework, yielding state-of-the-art generalization and detail preservation across multiple real-world datasets (Song et al., 5 Jan 2025).

1. Architecture and Single-Step Deterministic Pipeline

DepthMaster builds on the backbone of Stable Diffusion v2, which is a Latent Diffusion Model (LDM) pre-trained on the LAION-5B dataset. The LDM components include:

  • An encoder–decoder pair (E,D)(\mathcal{E},\mathcal{D}) (a variational autoencoder): images I∈RH×W×3I \in \mathbb{R}^{H \times W \times 3} are mapped to latents z∈Rh×w×cz \in \mathbb{R}^{h \times w \times c} and back.
  • A U-Net denoiser ϵθ(z,t)\epsilon_\theta(z, t): trained to denoise noisy latent codes.

Unlike conventional multi-step diffusion pipelines, DepthMaster achieves depth prediction through a single deterministic U-Net pass at fixed timestep t=1t=1:

  1. Latent encoding: zRGB=E(I)z_{\text{RGB}} = \mathcal{E}(I).
  2. Depth latent prediction: zpred=ϵθ(zRGB,1)z_{\text{pred}} = \epsilon_\theta(z_{\text{RGB}}, 1).
  3. Decoding for depth map: D^=D(zpred)\hat{D} = \mathcal{D}(z_{\text{pred}}).

The prediction is made in square-root disparity space, 1/D1/\sqrt{D} (normalized to [−1,1][-1,1]), which emphasizes nearby depth values and ensures a more uniform latent distribution (Song et al., 5 Jan 2025).

2. Diffusion Process and Mathematical Formulation

DepthMaster inherits the latent DDPM notation but restricts inference and supervision to t=1t=1:

  • The forward (noising) process for a clean latent z0z_0 is: q(zt∣zt−1)=N(zt;1−βt zt−1,βtI)q(z_t | z_{t-1}) = \mathcal{N}(z_t; \sqrt{1-\beta_t} \, z_{t-1}, \beta_t I). Cumulatively: q(zt∣z0)=N(zt;αˉtz0,(1−αˉt)I)q(z_t | z_0) = \mathcal{N}(z_t; \sqrt{\bar{\alpha}_t} z_0, (1-\bar{\alpha}_t)I), where αt\alpha_t and αˉt\bar{\alpha}_t are standard DDPM schedule parameters.
  • The reverse (denoising) model is: pθ(zt−1∣zt)=N(zt−1;μθ(zt,t),Σθ(t))p_\theta(z_{t-1}|z_t) = \mathcal{N}(z_{t-1}; \mu_\theta(z_t, t), \Sigma_\theta(t)).
  • The standard DDPM loss: LDDPM=Ez0,ϵ,t[∥ϵ−ϵθ(zt,t)∥2],L_{\text{DDPM}} = \mathbb{E}_{z_0,\epsilon,t} [\|\epsilon - \epsilon_\theta(z_t, t)\|^2], with zt=αˉtz0+1−αˉtϵz_t = \sqrt{\bar{\alpha}_t} z_0 + \sqrt{1-\bar{\alpha}_t} \epsilon.
  • DepthMaster’s deterministic reformulation at t=1t=1: zpred:=ϵθ(zRGB,1)z_{\text{pred}} := \epsilon_\theta(z_{\text{RGB}}, 1), supervised by Llatent=∥zGT−zpred∥2L_{\text{latent}} = \|z_{\text{GT}} - z_{\text{pred}}\|^2 with zGT=E(Dpreprocessed)z_{\text{GT}} = \mathcal{E}(D_{\text{preprocessed}}).

This design enables direct, discriminative supervision in the latent space for depth estimation.

3. Feature Alignment Module

The Feature Alignment module addresses overfitting to texture details induced by generative features. It injects high-level semantic information into U-Net’s deep features, using frozen, pre-trained visual encoders—specifically, DINOv2 yields optimal gains.

Inputs:

  • RGB image II.
  • Semantic features Fext=f(I)∈RN×DF_{\text{ext}} = f(I) \in \mathbb{R}^{N \times D}, with ff = DINOv2 encoder and NN = patch count.
  • U-Net intermediate feature Funet∈Rh×w×CF_{\text{unet}} \in \mathbb{R}^{h \times w \times C} (from middle block).

Procedure:

  1. Reshape FunetF_{\text{unet}} to (N×C)(N \times C).
  2. Project via MLP hϕh_\phi: Fˉunet=hϕ(Funet)∈RN×D\bar{F}_{\text{unet}} = h_\phi(F_{\text{unet}}) \in \mathbb{R}^{N \times D}.
  3. Feature-wise normalization: produce distributions F~ext,F~unet\tilde{F}_{\text{ext}}, \tilde{F}_{\text{unet}} using either â„“2\ell_2 normalization plus Softmax or direct Softmax (F/Ï„)(F/\tau).

Alignment loss is computed as:

Lfa=KL(F~ext∥F~unet)L_{\text{fa}} = \mathrm{KL}(\tilde{F}_{\text{ext}} \| \tilde{F}_{\text{unet}})

Minimizing LfaL_{\text{fa}} tightly ties U-Net latent representations to external semantic manifolds, with DINOv2 providing strongest improvements (KITTI AbsRel ↓\downarrow from 0.087→0.0830.087 \to 0.083, δ1\delta_1 ↑\uparrow 93.1%→93.7%93.1\%\to93.7\%) (Song et al., 5 Jan 2025).

4. Fourier Enhancement Module

This module recovers high-frequency local details lost in one-step prediction by simulating multi-step denoising in the frequency domain. At the middle U-Net block (Fmid∈RC×h×wF_{\text{mid}} \in \mathbb{R}^{C \times h \times w}):

  • Spatial branch: Fs=Convs(Fmid)F_s = \mathrm{Conv}_s(F_{\text{mid}}).
  • Frequency branch: Ff,raw=FFT2D(Fmid)F_{f,\text{raw}} = \mathrm{FFT}_{2D}(F_{\text{mid}}); modulation M=Convf(Ff,raw)→σ(M)M = \mathrm{Conv}_f(F_{f,\text{raw}})\to \sigma(M) (e.g., SiLU activation); back-transform Ff=iFFT2D(σ(M))F_f = \mathrm{iFFT}_{2D}(\sigma(M)).
  • Fusion: concatenate [Fs ∥ Ff][F_s \,\|\, F_f] and apply Convfuse\mathrm{Conv}_{\text{fuse}}.

Learned modulation in the frequency domain allows adaptive balancing of structure and detail. Ablations indicate significant increases in edge fidelity (F1 on Hypersim: baseline $0.306$, full module $0.314$+) (Song et al., 5 Jan 2025).

5. Two-Stage Training Framework

DepthMaster optimizes disentangled learning objectives via a curriculum of two training stages:

Stage 1: Structure Pre-Training

  • Freeze encoder–decoder (E,D)(\mathcal{E},\mathcal{D}), train only U-Net and Feature Alignment.
  • Loss: Lstage1=Llatent+λfaLfaL_{\text{stage1}} = L_{\text{latent}} + \lambda_{\text{fa}} L_{\text{fa}}; λfa=1\lambda_{\text{fa}}=1.

Stage 2: Detail Refinement

  • Initialize from Stage 1; activate Fourier Enhancement module.
  • Pixel-level supervision: Lpixel=Ex,y[(Dpred(x,y)−DGT(x,y))2]L_{\text{pixel}} = \mathbb{E}_{x,y} [(D_{\text{pred}}(x,y) - D_{\text{GT}}(x,y))^2].
  • Gradient map supervision: compute Gpred,GGT∈RH×W×4G_{\text{pred}}, G_{\text{GT}} \in \mathbb{R}^{H \times W \times 4} (horizontal, vertical, diagonal).
  • Weighted Huber gradient loss for δ=0.05\delta=0.05:

Lh(x,y,k)=L_h(x,y,k) =

   δ⋅∣ΔG∣\,\,\, \delta \cdot |\Delta G| for ∣ΔG∣≤δ|\Delta G| \leq \delta, else    12(ΔG)2+12δ2\,\,\, \frac{1}{2}(\Delta G)^2 + \frac{1}{2}\delta^2, with ΔG=GGT−Gpred\Delta G = G_{\text{GT}} - G_{\text{pred}}.

  • Total loss: Lstage2=Lpixel+λhLhL_{\text{stage2}} = L_{\text{pixel}} + \lambda_h L_h; λh=0.001\lambda_h=0.001.

Splitting structure and detail objectives mitigates conflicting gradients versus unified training and improves detail transfer and generalization (Song et al., 5 Jan 2025).

6. Experimental Validation and Benchmark Performance

DepthMaster demonstrates state-of-the-art zero-shot performance on five canonical datasets, evaluated using affine-invariant depth error (AbsRel ↓\downarrow) and accuracy (thresholded δ1↑\delta_1 \uparrow):

Dataset AbsRel ↓ δ1\delta_1 ↑ Rank
KITTI 0.082 93.7% 1
NYUv2 0.050 97.2% 1.2
ETH3D 0.053 97.4% 1.2
ScanNet 0.055 96.7% 1.2
DIODE 0.215 77.6% 1.2

Average rank: $1.2$—surpassing prior diffusion-based baselines (Marigold, GeoWizard, DepthFM, GenPercept, Lotus), and rivaling large-scale supervised models (Song et al., 5 Jan 2025).

Qualitative comparisons show improved preservation of global scene structure and sharper boundaries (visual F1 increase), with high-frequency details maintained (fine rails, leaves, furniture).

7. Module Analysis, Ablation Studies, and Implementation

Extensive ablation studies illuminate module efficacy:

  • Learning paradigm: iterative multi-step denoising increases accuracy slightly, but doubles inference time (∼0.8\sim0.8s vs. $0.42$s GPU); single-step achieves comparable AbsRel (0.103→0.1000.103\to0.100) at $0.42$s.
  • Depth preprocessing: disparity ($1/D$) improves AbsRel (KITTI: 0.103→0.0890.103\to0.089); 1/D\sqrt{1/D} further to $0.087$, enhancing latent uniformity.
  • Feature Alignment: DINOv2 encoder selection maximizes gain; alignment in deeper blocks is superior.
  • Fourier Enhancement and curriculum: edge F1 scores improve incrementally, with the full two-stage approach yielding the highest fidelity (Hypersim F1: $0.337$; KITTI AbsRel $0.082$; δ1\delta_1 93.7%93.7\%).

Implementation specifics:

  • Training: Hypersim ($54$K images) and Virtual KITTI ($20$K), $9:1$ mix.
  • Stage 1: $20$K iterations, Adam lr=3×10−53\times10^{-5}.
  • Stage 2: $10$K iterations, lr=3×10−63\times10^{-6}.
  • Batch size: $32$ via gradient accumulation, on NVIDIA H800 GPUs, $30$h per stage.

DepthMaster achieves high visual quality and generalization without sacrificing inference efficiency, with its modular architecture directly addressing the limitations of previous diffusion-based monocular depth estimation approaches (Song et al., 5 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DepthMaster Model.