Diffusion for Metric Depth (DMD)
- DMD is a framework that uses denoising diffusion probabilistic models to generate metrically accurate depth maps from visual inputs.
- It integrates physical cues and geometric constraints to resolve scale ambiguity and boost robustness in both indoor and outdoor scenes.
- The approach combines synthetic pre-training with plug-and-play test time adaptations, achieving state-of-the-art performance in depth estimation.
Diffusion for Metric Depth (DMD) refers to a family of methods that leverage denoising diffusion probabilistic models (DDPMs) and their latent variants to produce metrically accurate depth maps from visual input. The DMD paradigm addresses both the scale ambiguity endemic to monocular vision and key issues of generalization, robustness, and data availability that have challenged earlier approaches in depth estimation. By integrating physical cues, geometric constraints, or explicit parameterizations, DMD methods achieve zero-shot metric depth estimation—often with high fidelity and robustness across a wide range of domains including indoor/outdoor scenes, adverse conditions, and complex geometric structures.
1. Core Principles and Formulations
DMD recasts single-image or two-view depth estimation as an inverse problem in the context of generative diffusion models. The baseline formulation employs a Markov chain of latent variables or pixel-level representations, , governed by a forward (noising) process: where is a variance schedule, or, equivalently, , , and encodes a normalized or log-scaled metric depth map (Saxena et al., 2023, Saxena et al., 2023, Shah et al., 2024, Guizilini et al., 2024).
The reverse process is parameterized by a neural network (typically a U-Net or transformer backbone), which predicts the additive noise or score in —often conditioned on the RGB image, intrinsic parameters (e.g., FOV), and sometimes additional geometric or physical cues. Sampling trajectories may be orchestrated via DDPM, DDIM, or SDE-based solvers, with both L2 and L1 objective variants used for noise prediction and reconstruction (Saxena et al., 2023, Saxena et al., 2023, Guizilini et al., 2024, Shah et al., 2024).
A critical design element for DMD is domain-appropriate normalization of depth, typically via log-scaling: with chosen to span the relevant metric range (Saxena et al., 2023, Shah et al., 2024).
2. Geometric Constraints and Guidance Mechanisms
Several DMD variants enhance metric scale recovery by integrating geometric constraints or direct physical cues during inference:
- Stereo-guided metric scaling: GeoDiff introduces photometric reprojection losses between two (stereo) views, optimizing latent scale and shift parameters in conjunction with the frozen diffusion prior. The reprojection loss combines SSIM and L1 image differences, and the metric depth map is parameterized as . The pipeline iteratively combines standard denoising with gradients from the geometric loss (Pham et al., 21 Oct 2025).
- Defocus guidance: Diffusion-based monocular models are converted to metric predictors by incorporating defocus blur cues. Given all-in-focus and defocused image pairs, inference-time optimization adjusts latent depths and affine parameters to minimize the error between observed defocus images and those rendered via differentiable, depth-dependent point spread functions (Talegaonkar et al., 23 May 2025).
- Sparse depth steering: SteeredMarigold aligns latent predictions to sparse metric samples by steering the diffusion process toward known points and then fitting a metric affine transformation to the output (Gregorek et al., 2024).
- Field-of-View conditioning: DMD and MetricGold resolve scale ambiguity by conditioning the denoiser on camera FOV, synthetic intrinsics augmentation during training, and log-depth representation, supporting robust generalization across varied camera parameters (Saxena et al., 2023, Shah et al., 2024).
- Pixel-level geometric coding: GRIN leverages 3D intrinsic/positional encodings at the pixel level, allowing the diffusion process to internalize scale from viewing geometry (Guizilini et al., 2024).
3. Training Methodologies and Datasets
DMD methods typically combine large-scale synthetic pre-training and real-world fine-tuning, with explicit emphasis on zero-shot generalization:
- Synthetic/photorealistic data: MetricGold demonstrates strong generalization using only synthetic RGB–depth pairs (HyperSIM, VirtualKITTI2, TartanAir) paired with log-scale normalization (Shah et al., 2024).
- Unsupervised and supervised hybridization: Many DMD methods are bootstrapped with Palette-style unsupervised pre-training on general vision tasks (colorization, inpainting) to encode strong image priors, which are then transferred to metric depth via fine-tuning on NYU, KITTI, Waymo, DIML, and more (Saxena et al., 2023, Saxena et al., 2023, Guizilini et al., 2024).
- Plug-and-play adaptation: Some models (e.g., GeoDiff, SteeredMarigold) require no retraining. Metric scaling and geometric guidance are applied entirely at test time using pretrained diffusion priors (Pham et al., 21 Oct 2025, Gregorek et al., 2024).
- Contrastive robustification: D4RD enhances the stability and environmental robustness of the baseline by combining contrastive learning at noise, feature, and image levels, as well as stabilization via masking, sigmoid transforms, and feature concatenation (Wang et al., 2024).
4. Representative Models and Experimental Outcomes
A selection of DMD models and their salient properties are given below:
| Model | Key Design | Metric Depth Mechanism | Zero-Shot/Domain Transfer Capability |
|---|---|---|---|
| GeoDiff (Pham et al., 21 Oct 2025) | Latent diffusion+stereo | Test-time geometric reprojection, scale/shift opt. | Indoor, outdoor, specular glass, arbitrary poses |
| MetricGold (Shah et al., 2024) | Repurposed SDv2, log-depth | Latent diffusion in log-normalized space | Synthetic→real, robust scale, generalization |
| DMD (Saxena et al., 2023, Saxena et al., 2023) | Efficient U-Net+FOV cond. | Log-depth, FOV parameter embedding | Indoor/outdoor, fast sampling, SOTA error |
| SteeredMarigold (Gregorek et al., 2024) | Plug-and-play steering | Sparse depth conditioning, least-squares scale | Zero-shot completion from partial data |
| GRIN (Guizilini et al., 2024) | Pixel-level DDPM+geo PE | 3D positional encodings, log-depth tokens | Zero-shot, sparse/uncorrelated training |
| FiffDepth (Bai et al., 2024) | FFN-transformed diffusion | One-pass mapping from SD U-Net, DINOv2 pseudo GT | Quick inference, strong boundary fidelity |
| D4RD (Wang et al., 2024) | Diffusion+contrastive | Trinity noise/feat/image contrast, robust loss | Corruption robustness, weather generalization |
Salient empirical metrics highlight the superiority of DMD-based methods over classical discriminative or relative-depth-only models. For instance, DMD achieves a reduction in zero-shot relative error (REL) of 25–33% over ZoeDepth on both indoor and outdoor settings (Saxena et al., 2023), and GRIN attains the best reported AbsRel and RMSE on KITTI, NYU, and six other benchmarks with zero dataset-specific fine-tuning (Guizilini et al., 2024). GeoDiff demonstrates best-in-class AbsRel/RMSE on KITTI/Middlebury and outperforms stereo methods on Booster in challenging specular/transparent settings (Pham et al., 21 Oct 2025).
5. Generalization, Uncertainty, and Limitations
DMD models possess several features promoting robustness and uncertainty estimation:
- Epistemic uncertainty quantification: Multiple diffusion samples allow Monte Carlo estimation of pixel-wise means/variances, directly exposing regions of high depth ambiguity (e.g., occlusions, specularities) (Saxena et al., 2023, Guizilini et al., 2024, Jun et al., 5 Jun 2025).
- Zero-shot generalization: Models trained solely on synthetic data often transfer to novel real domains (NYUv2, KITTI, DIODE, TartanReal) without ground-truth depth (Shah et al., 2024, Guizilini et al., 2024).
- Plug-and-play composition: Several pipelines allow rapid adaptation to new cues (e.g., stereo, defocus) with minimal computational or annotation overhead (Pham et al., 21 Oct 2025, Talegaonkar et al., 23 May 2025).
- Limitations: DMD methods are computationally intensive at inference due to iterative denoising; scaling to real-time remains a challenge, though hybrid feed-forward designs (FiffDepth) or latent consistency distillation are being explored (Bai et al., 2024, Shah et al., 2024). Some variants remain sensitive to the intrinsic quality and domain fidelity of the diffusion prior, and photometric guidance strategies may require future work to handle illumination or domain shifts (Pham et al., 21 Oct 2025, Shah et al., 2024).
6. Broader Applications and Integration into Vision Systems
DMD frameworks extend naturally to a variety of downstream and related tasks:
- Self-supervised and stereo-based learning: Integration of diffusion-generated multi-baseline stereo views (DMS) enables robust self-supervised training for both stereo and monocular networks, even in the presence of occluded or ill-posed regions (Liu et al., 18 Aug 2025).
- 4D dynamic scene generation: DiST-4D employs disentangled spatiotemporal diffusion over metric RGBD latents to synthesize temporally and spatially consistent 4D driving scenes, leveraging metric depth to bridge cross-camera and temporal consistency (Guo et al., 19 Mar 2025).
- Uncertainty-aware enhancement: Sensor depth completion and denoising can be performed via stochastic diffusion stages for uncertainty detection and deterministic refinement for local accuracy, setting new baselines for inpainting and artifact suppression (Jun et al., 5 Jun 2025).
- Boundary and detail preservation: Distilled or hybrid DMD models (SharpDepth, FiffDepth) achieve a unique trade-off, preserving fine spatial detail and boundary accuracy while maintaining metric fidelity, which is crucial for robotics and AR applications (Pham et al., 2024, Bai et al., 2024).
7. Summary and Ongoing Directions
Diffusion for Metric Depth establishes a rigorously principled and empirically validated foundation for metrically accurate depth estimation. Its design space encompasses conditional geometric guidance, log-scaled normalization, flexible architectural paradigms (latent/pixel, transformer/U-Net), and robust training with limited ground-truth. Continuous developments target the outstanding challenges of efficiency, domain adaptation, real-time inference, and multimodal scalability.
Key frontiers involve consistency or GAN-based distillation for fast inference (Shah et al., 2024), more expressive geometric priors and positional encoding schemes (Guizilini et al., 2024), and tailored photometric or physical cues to resolve remaining ambiguities in unconstrained settings (Pham et al., 21 Oct 2025, Talegaonkar et al., 23 May 2025). These advances are catalyzing progress not only in depth estimation, but also in 4D perception, inpainting, uncertainty-aware enhancement, and unified geometric-vision pipelines.