DepthMaster: Deterministic LDM for Depth Estimation
- The paper introduces DepthMaster, a deterministic single-step latent diffusion model that achieves zero-shot monocular depth estimation with enhanced semantic and frequency features.
- It integrates a Feature Alignment module using DINOv2 and a Fourier Enhancement module to improve edge fidelity and preserve high-frequency details in depth maps.
- Empirical evaluations on datasets like KITTI and NYUv2 demonstrate superior generalization and precision through a two-stage training framework.
DepthMaster is a single-step deterministic adaptation of latent diffusion models explicitly designed for zero-shot monocular depth estimation. Operating within the diffusion-denoising paradigm, DepthMaster strategically incorporates semantic feature alignment and Fourier-domain detail enhancement within a two-stage training framework, yielding state-of-the-art generalization and detail preservation across multiple real-world datasets (Song et al., 5 Jan 2025).
1. Architecture and Single-Step Deterministic Pipeline
DepthMaster builds on the backbone of Stable Diffusion v2, which is a Latent Diffusion Model (LDM) pre-trained on the LAION-5B dataset. The LDM components include:
- An encoder–decoder pair (a variational autoencoder): images are mapped to latents and back.
- A U-Net denoiser : trained to denoise noisy latent codes.
Unlike conventional multi-step diffusion pipelines, DepthMaster achieves depth prediction through a single deterministic U-Net pass at fixed timestep :
- Latent encoding: .
- Depth latent prediction: .
- Decoding for depth map: .
The prediction is made in square-root disparity space, (normalized to ), which emphasizes nearby depth values and ensures a more uniform latent distribution (Song et al., 5 Jan 2025).
2. Diffusion Process and Mathematical Formulation
DepthMaster inherits the latent DDPM notation but restricts inference and supervision to :
- The forward (noising) process for a clean latent is: . Cumulatively: , where and are standard DDPM schedule parameters.
- The reverse (denoising) model is: .
- The standard DDPM loss: with .
- DepthMaster’s deterministic reformulation at : , supervised by with .
This design enables direct, discriminative supervision in the latent space for depth estimation.
3. Feature Alignment Module
The Feature Alignment module addresses overfitting to texture details induced by generative features. It injects high-level semantic information into U-Net’s deep features, using frozen, pre-trained visual encoders—specifically, DINOv2 yields optimal gains.
Inputs:
- RGB image .
- Semantic features , with = DINOv2 encoder and = patch count.
- U-Net intermediate feature (from middle block).
Procedure:
- Reshape to .
- Project via MLP : .
- Feature-wise normalization: produce distributions using either normalization plus Softmax or direct Softmax .
Alignment loss is computed as:
Minimizing tightly ties U-Net latent representations to external semantic manifolds, with DINOv2 providing strongest improvements (KITTI AbsRel from , ) (Song et al., 5 Jan 2025).
4. Fourier Enhancement Module
This module recovers high-frequency local details lost in one-step prediction by simulating multi-step denoising in the frequency domain. At the middle U-Net block ():
- Spatial branch: .
- Frequency branch: ; modulation (e.g., SiLU activation); back-transform .
- Fusion: concatenate and apply .
Learned modulation in the frequency domain allows adaptive balancing of structure and detail. Ablations indicate significant increases in edge fidelity (F1 on Hypersim: baseline $0.306$, full module $0.314$+) (Song et al., 5 Jan 2025).
5. Two-Stage Training Framework
DepthMaster optimizes disentangled learning objectives via a curriculum of two training stages:
Stage 1: Structure Pre-Training
- Freeze encoder–decoder , train only U-Net and Feature Alignment.
- Loss: ; .
Stage 2: Detail Refinement
- Initialize from Stage 1; activate Fourier Enhancement module.
- Pixel-level supervision: .
- Gradient map supervision: compute (horizontal, vertical, diagonal).
- Weighted Huber gradient loss for :
for , else , with .
- Total loss: ; .
Splitting structure and detail objectives mitigates conflicting gradients versus unified training and improves detail transfer and generalization (Song et al., 5 Jan 2025).
6. Experimental Validation and Benchmark Performance
DepthMaster demonstrates state-of-the-art zero-shot performance on five canonical datasets, evaluated using affine-invariant depth error (AbsRel ) and accuracy (thresholded ):
| Dataset | AbsRel ↓ | ↑ | Rank |
|---|---|---|---|
| KITTI | 0.082 | 93.7% | 1 |
| NYUv2 | 0.050 | 97.2% | 1.2 |
| ETH3D | 0.053 | 97.4% | 1.2 |
| ScanNet | 0.055 | 96.7% | 1.2 |
| DIODE | 0.215 | 77.6% | 1.2 |
Average rank: $1.2$—surpassing prior diffusion-based baselines (Marigold, GeoWizard, DepthFM, GenPercept, Lotus), and rivaling large-scale supervised models (Song et al., 5 Jan 2025).
Qualitative comparisons show improved preservation of global scene structure and sharper boundaries (visual F1 increase), with high-frequency details maintained (fine rails, leaves, furniture).
7. Module Analysis, Ablation Studies, and Implementation
Extensive ablation studies illuminate module efficacy:
- Learning paradigm: iterative multi-step denoising increases accuracy slightly, but doubles inference time (s vs. $0.42$s GPU); single-step achieves comparable AbsRel () at $0.42$s.
- Depth preprocessing: disparity ($1/D$) improves AbsRel (KITTI: ); further to $0.087$, enhancing latent uniformity.
- Feature Alignment: DINOv2 encoder selection maximizes gain; alignment in deeper blocks is superior.
- Fourier Enhancement and curriculum: edge F1 scores improve incrementally, with the full two-stage approach yielding the highest fidelity (Hypersim F1: $0.337$; KITTI AbsRel $0.082$; ).
Implementation specifics:
- Training: Hypersim ($54$K images) and Virtual KITTI ($20$K), $9:1$ mix.
- Stage 1: $20$K iterations, Adam lr=.
- Stage 2: $10$K iterations, lr=.
- Batch size: $32$ via gradient accumulation, on NVIDIA H800 GPUs, $30$h per stage.
DepthMaster achieves high visual quality and generalization without sacrificing inference efficiency, with its modular architecture directly addressing the limitations of previous diffusion-based monocular depth estimation approaches (Song et al., 5 Jan 2025).