Papers
Topics
Authors
Recent
Search
2000 character limit reached

MonoDiffusion: Diffusion-based Depth Estimation

Updated 31 January 2026
  • MonoDiffusion is a self-supervised framework that reformulates monocular depth estimation as a conditioned diffusion denoising process, eliminating dependence on true depth labels.
  • It incorporates pseudo ground truth from a teacher network through a forward diffusion process to guide training in the absence of ground-truth depth.
  • A masked visual condition mechanism enhances robustness by randomly masking image features during training, improving depth boundary reconstruction and overall performance.

MonoDiffusion is a self-supervised monocular depth estimation framework that formulates depth regression as a conditional diffusion-based denoising process. Departing from direct depth map prediction, MonoDiffusion iteratively denoises random depth noise guided by image-derived features, compensating for the absence of true depth ground-truth through a pseudo ground-truth diffusion process and a masked visual condition mechanism. This paradigm enables state-of-the-art depth estimation performance on established benchmarks without requiring depth labels at any training stage (Shao et al., 2023).

1. Depth Estimation as a Conditional Diffusion Process

MonoDiffusion recasts monocular depth estimation as a sequence of denoising operations within the generative framework of diffusion models. The standard forward (noising) process is formalized as: q(DτDτ1)=N(Dτ;1βτDτ1,βτI)q(D_\tau\mid D_{\tau-1}) = \mathcal{N}\left(D_\tau; \sqrt{1-\beta_\tau}\, D_{\tau-1}, \beta_\tau I\right) where {βτ}τ=1T\{\beta_\tau\}_{\tau=1}^T is a fixed noise schedule and DTD_T is initialized as random noise. The joint marginal at step τ\tau is

q(DτD0)=N(Dτ;αˉτD0,(1αˉτ)I),αˉτ=n=1τ(1βn)q(D_\tau\mid D_0) = \mathcal{N}\left(D_\tau; \sqrt{\bar\alpha_\tau}\, D_0, (1-\bar\alpha_\tau)I\right),\quad \bar\alpha_\tau = \prod_{n=1}^\tau (1-\beta_n)

In classical diffusion models, D0D_0 corresponds to true data (here, the depth map). MonoDiffusion, however, initiates inference from DTD_T and applies a deterministic reverse (denoising) process parameterized as: pθ(Dτ1Dτ,c)=N(Dτ1;μθ(Dτ,τ,c),στ2I)p_\theta(D_{\tau-1}\mid D_\tau, c) = \mathcal{N}\left(D_{\tau-1}; \mu_\theta(D_\tau, \tau, c), \sigma_\tau^2 I\right) with the estimate

Dτ1=μθ(Dτ,τ,c)D_{\tau-1} = \mu_\theta(D_\tau, \tau, c)

adopting the DDIM sampler [Song et al., ICLR’21] for determinism by setting στ=0\sigma_\tau = 0. Each reverse step refines the estimate toward a coherent depth map, conditioned on visual features cc.

2. Pseudo Ground-Truth Diffusion via Self-Supervised Teacher

True depth labels are unavailable in self-supervised settings. MonoDiffusion circumvents this by training a “teacher” depth network, based on Lite-Mono [Zhang et al., CVPR’23], using standard photometric self-supervision. The teacher’s output Dpseudo(p)D_{\rm pseudo}(p) is then treated as a stand-in for ground-truth. MonoDiffusion defines a pseudo-diffusion forward process: q(DτDpseudo)=N(Dτ;αˉτDpseudo,(1αˉτ)I)q(D_\tau \mid D_{\rm pseudo}) = \mathcal{N}(D_\tau; \sqrt{\bar\alpha_\tau}\, D_{\rm pseudo}, (1-\bar\alpha_\tau)I) During student training, noisy pseudo depth maps DτD_\tau are sampled from this distribution, and the network learns the reverse denoising process to reconstruct DpseudoD_{\rm pseudo}, leveraging the teacher’s predictions as self-supervised targets.

3. Masked Visual Condition Mechanism

At each diffusion step, the MonoDiffusion student network is conditioned on multi-scale image feature tokens {Ei}\{E_i\} extracted from an encoder. To foster robust denoising and reliance on critical image context, a binary mask MiM_i randomly zeros out r%r\% of tokens at each scale during training (empirically, r=20%r=20\%). The resulting masked feature tokens are processed via 3×33\times3 convolution and upsample condition modules, producing a hierarchy of masked-visual conditions cmc_{\rm m}. Complete (unmasked) features yield the full-visual condition cc. The noise prediction module ϵθ(Dτ,τ,c,cm)\epsilon_\theta(D_\tau,\tau,c,c_{\rm m}) must estimate clean depth even with incomplete context. Training applies a reconstruction loss: Lrec=pD^t(p)Dt(p)\mathcal{L}_{rec} = \sum_p |\widehat D^t(p) - D^t(p)| where D^t(p)\widehat D^t(p) and Dt(p)D^t(p) are the masked and unmasked predictions, respectively; this enforces invariance to partial context masking.

4. Loss Functions and Training Objective

MonoDiffusion adopts a multiterm loss function to balance self-supervised depth learning, knowledge distillation, and diffusion objectives. The total loss is: L=p[Lph+LKD+0.1Lrec+Lddim]\mathcal{L} = \sum_p \left[ \mathcal{L}_{ph} + \mathcal{L}_{KD} + 0.1\,\mathcal{L}_{rec} + \mathcal{L}_{ddim} \right] where:

  • Lph\mathcal{L}_{ph}: photometric self-supervision (SSIM and L1 reprojection, λ1=1\lambda_1 = 1)
  • LKD\mathcal{L}_{KD}: knowledge distillation, defined as

LKD=pΦ(p)Dt(p)Dpseudo(p)\mathcal{L}_{KD} = \sum_p \Phi(p) |D^t(p) - D_{\rm pseudo}(p)|

using validity mask Φ(p)\Phi(p) to exclude unreliable teacher estimates

  • Lrec\mathcal{L}_{rec}: masked-condition reconstruction loss, weighted with λ3=0.1\lambda_3 = 0.1
  • Lddim\mathcal{L}_{ddim}: DDIM denoising loss, ϵϵθ(Dτ,τ,c)2\|\epsilon - \epsilon_\theta(D_\tau, \tau, c)\|^2, with λ4=1\lambda_4 = 1

LKD\mathcal{L}_{KD} employs multi-view consistency checks [Liu et al., TCSVT’23] to generate Φ(p)\Phi(p), removing pixels with inconsistent depth across multiple viewpoints.

5. Empirical Evaluation and Results

MonoDiffusion has been evaluated on canonical monocular depth estimation benchmarks:

Dataset Frames (Train/Val/Test) Metrics MonoDiffusion (3.1M) Lite-Mono
KITTI Eigen-split 39,180/4,424/697 Abs Rel / Sq Rel / RMSE / RMSE log / δ<1.25\delta<1.25 0.103 / 0.726 / 4.447 / 0.179 / 0.893 0.107 / 0.765 / 4.561 / 0.183 / 0.886
Make3D (zero-shot) 134 test Abs Rel / Sq Rel / RMSE / RMSE log 0.295 / 2.849 / 6.854 / 0.150 0.305 / 3.060 / 6.981 / 0.158

Abs Rel, Sq Rel, RMSE, RMSE log: lower is better. δ<1.25\delta<1.25: higher is better.

MonoDiffusion outperforms prior state-of-the-art self-supervised methods, showing notable gains, particularly on boundaries (e.g., poles, road signs), long-range region separation, and the coherence of reconstructed 3D point clouds. Zero-shot transfer to Make3D exhibits robust generalization (Shao et al., 2023).

Performance ablation reveals critical dependencies: removing the pseudo ground-truth diffusion leads to non-convergence; incremental addition of pseudo-diffusion, distillation, and masked-condition successively strengthens accuracy; and optimal results arise near 20 inference steps, with under- or over-parameterization degrading accuracy.

6. Analysis, Implications, and Limitations

MonoDiffusion’s integration of diffusion models into self-supervised depth estimation, enabled by pseudo ground-truth mechanisms and context-masked visual conditions, advances the state-of-the-art without ground-truth labels. Distillation using a teacher model introduces dependence on initial teacher quality; unreliable teacher predictions are mitigated via multi-view consistency checks and validity masks. Empirically, the network is sensitive to inference step count and mask ratio, underlining the importance of hyperparameter selection.

A plausible implication is that the conditional diffusion paradigm generalizes to other label-sparse, ill-posed dense prediction tasks—provided pseudo-targets of sufficient fidelity are available. Limitations include failure to converge without high-quality pseudo-diffusion and diminishing returns at higher step counts.

MonoDiffusion demonstrates that bridging generative modeling and self-supervised depth estimation, with auxiliary teacher and condition-masking mechanisms, yields effective and scalable single-image 3D understanding (Shao et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MonoDiffusion.