DepthMaster: Deterministic LDM for Depth Estimation

Updated 31 January 2026

The paper introduces DepthMaster, a deterministic single-step latent diffusion model that achieves zero-shot monocular depth estimation with enhanced semantic and frequency features.
It integrates a Feature Alignment module using DINOv2 and a Fourier Enhancement module to improve edge fidelity and preserve high-frequency details in depth maps.
Empirical evaluations on datasets like KITTI and NYUv2 demonstrate superior generalization and precision through a two-stage training framework.

DepthMaster is a single-step deterministic adaptation of latent diffusion models explicitly designed for zero-shot monocular depth estimation. Operating within the diffusion-denoising paradigm, DepthMaster strategically incorporates semantic feature alignment and Fourier-domain detail enhancement within a two-stage training framework, yielding state-of-the-art generalization and detail preservation across multiple real-world datasets (Song et al., 5 Jan 2025).

1. Architecture and Single-Step Deterministic Pipeline

DepthMaster builds on the backbone of Stable Diffusion v2, which is a Latent Diffusion Model (LDM) pre-trained on the LAION-5B dataset. The LDM components include:

An encoder–decoder pair $(\mathcal{E},\mathcal{D})$ (a variational autoencoder): images $I \in \mathbb{R}^{H \times W \times 3}$ are mapped to latents $z \in \mathbb{R}^{h \times w \times c}$ and back.
A U-Net denoiser $\epsilon_\theta(z, t)$ : trained to denoise noisy latent codes.

Unlike conventional multi-step diffusion pipelines, DepthMaster achieves depth prediction through a single deterministic U-Net pass at fixed timestep $t=1$ :

Latent encoding: $z_{\text{RGB}} = \mathcal{E}(I)$ .
Depth latent prediction: $z_{\text{pred}} = \epsilon_\theta(z_{\text{RGB}}, 1)$ .
Decoding for depth map: $\hat{D} = \mathcal{D}(z_{\text{pred}})$ .

The prediction is made in square-root disparity space, $1/\sqrt{D}$ (normalized to $[-1,1]$ ), which emphasizes nearby depth values and ensures a more uniform latent distribution (Song et al., 5 Jan 2025).

2. Diffusion Process and Mathematical Formulation

DepthMaster inherits the latent DDPM notation but restricts inference and supervision to $t=1$ :

The forward (noising) process for a clean latent $z_0$ is: $q(z_t | z_{t-1}) = \mathcal{N}(z_t; \sqrt{1-\beta_t} \, z_{t-1}, \beta_t I)$ . Cumulatively: $q(z_t | z_0) = \mathcal{N}(z_t; \sqrt{\bar{\alpha}_t} z_0, (1-\bar{\alpha}_t)I)$ , where $\alpha_t$ and $\bar{\alpha}_t$ are standard DDPM schedule parameters.
The reverse (denoising) model is: $p_\theta(z_{t-1}|z_t) = \mathcal{N}(z_{t-1}; \mu_\theta(z_t, t), \Sigma_\theta(t))$ .
The standard DDPM loss: $L_{\text{DDPM}} = \mathbb{E}_{z_0,\epsilon,t} [\|\epsilon - \epsilon_\theta(z_t, t)\|^2],$ with $z_t = \sqrt{\bar{\alpha}_t} z_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ .
DepthMaster’s deterministic reformulation at $t=1$ : $z_{\text{pred}} := \epsilon_\theta(z_{\text{RGB}}, 1)$ , supervised by $L_{\text{latent}} = \|z_{\text{GT}} - z_{\text{pred}}\|^2$ with $z_{\text{GT}} = \mathcal{E}(D_{\text{preprocessed}})$ .

This design enables direct, discriminative supervision in the latent space for depth estimation.

3. Feature Alignment Module

The Feature Alignment module addresses overfitting to texture details induced by generative features. It injects high-level semantic information into U-Net’s deep features, using frozen, pre-trained visual encoders—specifically, DINOv2 yields optimal gains.

Inputs:

RGB image $I$ .
Semantic features $F_{\text{ext}} = f(I) \in \mathbb{R}^{N \times D}$ , with $f$ = DINOv2 encoder and $N$ = patch count.
U-Net intermediate feature $F_{\text{unet}} \in \mathbb{R}^{h \times w \times C}$ (from middle block).

Procedure:

Reshape $F_{\text{unet}}$ to $(N \times C)$ .
Project via MLP $h_\phi$ : $\bar{F}_{\text{unet}} = h_\phi(F_{\text{unet}}) \in \mathbb{R}^{N \times D}$ .
Feature-wise normalization: produce distributions $\tilde{F}_{\text{ext}}, \tilde{F}_{\text{unet}}$ using either $\ell_2$ normalization plus Softmax or direct Softmax $(F/\tau)$ .

Alignment loss is computed as:

$L_{\text{fa}} = \mathrm{KL}(\tilde{F}_{\text{ext}} \| \tilde{F}_{\text{unet}})$

Minimizing $L_{\text{fa}}$ tightly ties U-Net latent representations to external semantic manifolds, with DINOv2 providing strongest improvements (KITTI AbsRel $\downarrow$ from $0.087 \to 0.083$ , $\delta_1$ $\uparrow$ $93.1\%\to93.7\%$ ) (Song et al., 5 Jan 2025).

4. Fourier Enhancement Module

This module recovers high-frequency local details lost in one-step prediction by simulating multi-step denoising in the frequency domain. At the middle U-Net block ( $F_{\text{mid}} \in \mathbb{R}^{C \times h \times w}$ ):

Spatial branch: $F_s = \mathrm{Conv}_s(F_{\text{mid}})$ .
Frequency branch: $F_{f,\text{raw}} = \mathrm{FFT}_{2D}(F_{\text{mid}})$ ; modulation $M = \mathrm{Conv}_f(F_{f,\text{raw}})\to \sigma(M)$ (e.g., SiLU activation); back-transform $F_f = \mathrm{iFFT}_{2D}(\sigma(M))$ .
Fusion: concatenate $[F_s \,\|\, F_f]$ and apply $\mathrm{Conv}_{\text{fuse}}$ .

Learned modulation in the frequency domain allows adaptive balancing of structure and detail. Ablations indicate significant increases in edge fidelity (F1 on Hypersim: baseline $0.306$, full module $0.314$+) (Song et al., 5 Jan 2025).

5. Two-Stage Training Framework

DepthMaster optimizes disentangled learning objectives via a curriculum of two training stages:

Stage 1: Structure Pre-Training

Freeze encoder–decoder $(\mathcal{E},\mathcal{D})$ , train only U-Net and Feature Alignment.
Loss: $L_{\text{stage1}} = L_{\text{latent}} + \lambda_{\text{fa}} L_{\text{fa}}$ ; $\lambda_{\text{fa}}=1$ .

Stage 2: Detail Refinement

Initialize from Stage 1; activate Fourier Enhancement module.
Pixel-level supervision: $L_{\text{pixel}} = \mathbb{E}_{x,y} [(D_{\text{pred}}(x,y) - D_{\text{GT}}(x,y))^2]$ .
Gradient map supervision: compute $G_{\text{pred}}, G_{\text{GT}} \in \mathbb{R}^{H \times W \times 4}$ (horizontal, vertical, diagonal).
Weighted Huber gradient loss for $\delta=0.05$ :

$L_h(x,y,k) =$

$\,\,\, \delta \cdot |\Delta G|$ for $|\Delta G| \leq \delta$ , else $\,\,\, \frac{1}{2}(\Delta G)^2 + \frac{1}{2}\delta^2$ , with $\Delta G = G_{\text{GT}} - G_{\text{pred}}$ .

Total loss: $L_{\text{stage2}} = L_{\text{pixel}} + \lambda_h L_h$ ; $\lambda_h=0.001$ .

Splitting structure and detail objectives mitigates conflicting gradients versus unified training and improves detail transfer and generalization (Song et al., 5 Jan 2025).

6. Experimental Validation and Benchmark Performance

DepthMaster demonstrates state-of-the-art zero-shot performance on five canonical datasets, evaluated using affine-invariant depth error (AbsRel $\downarrow$ ) and accuracy (thresholded $\delta_1 \uparrow$ ):

Dataset	AbsRel ↓	$\delta_1$ ↑	Rank
KITTI	0.082	93.7%	1
NYUv2	0.050	97.2%	1.2
ETH3D	0.053	97.4%	1.2
ScanNet	0.055	96.7%	1.2
DIODE	0.215	77.6%	1.2

Average rank: $1.2$—surpassing prior diffusion-based baselines (Marigold, GeoWizard, DepthFM, GenPercept, Lotus), and rivaling large-scale supervised models (Song et al., 5 Jan 2025).

Qualitative comparisons show improved preservation of global scene structure and sharper boundaries (visual F1 increase), with high-frequency details maintained (fine rails, leaves, furniture).

7. Module Analysis, Ablation Studies, and Implementation

Extensive ablation studies illuminate module efficacy:

Learning paradigm: iterative multi-step denoising increases accuracy slightly, but doubles inference time ( $\sim0.8$ s vs. $0.42$s GPU); single-step achieves comparable AbsRel ( $0.103\to0.100$ ) at $0.42$s.
Depth preprocessing: disparity ($1/D$) improves AbsRel (KITTI: $0.103\to0.089$ ); $\sqrt{1/D}$ further to $0.087$, enhancing latent uniformity.
Feature Alignment: DINOv2 encoder selection maximizes gain; alignment in deeper blocks is superior.
Fourier Enhancement and curriculum: edge F1 scores improve incrementally, with the full two-stage approach yielding the highest fidelity (Hypersim F1: $0.337$; KITTI AbsRel $0.082$; $\delta_1$ $93.7\%$ ).

Implementation specifics:

Training: Hypersim ($54$K images) and Virtual KITTI ($20$K), $9:1$ mix.
Stage 1: $20$K iterations, Adam lr= $3\times10^{-5}$ .
Stage 2: $10$K iterations, lr= $3\times10^{-6}$ .
Batch size: $32$ via gradient accumulation, on NVIDIA H800 GPUs, $30$h per stage.

DepthMaster achieves high visual quality and generalization without sacrificing inference efficiency, with its modular architecture directly addressing the limitations of previous diffusion-based monocular depth estimation approaches (Song et al., 5 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DepthMaster: Taming Diffusion Models for Monocular Depth Estimation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DepthMaster Model.