MonoDiffusion: Diffusion-based Depth Estimation

Updated 31 January 2026

MonoDiffusion is a self-supervised framework that reformulates monocular depth estimation as a conditioned diffusion denoising process, eliminating dependence on true depth labels.
It incorporates pseudo ground truth from a teacher network through a forward diffusion process to guide training in the absence of ground-truth depth.
A masked visual condition mechanism enhances robustness by randomly masking image features during training, improving depth boundary reconstruction and overall performance.

MonoDiffusion is a self-supervised monocular depth estimation framework that formulates depth regression as a conditional diffusion-based denoising process. Departing from direct depth map prediction, MonoDiffusion iteratively denoises random depth noise guided by image-derived features, compensating for the absence of true depth ground-truth through a pseudo ground-truth diffusion process and a masked visual condition mechanism. This paradigm enables state-of-the-art depth estimation performance on established benchmarks without requiring depth labels at any training stage (Shao et al., 2023).

1. Depth Estimation as a Conditional Diffusion Process

MonoDiffusion recasts monocular depth estimation as a sequence of denoising operations within the generative framework of diffusion models. The standard forward (noising) process is formalized as: $q(D_\tau\mid D_{\tau-1}) = \mathcal{N}\left(D_\tau; \sqrt{1-\beta_\tau}\, D_{\tau-1}, \beta_\tau I\right)$ where $\{\beta_\tau\}_{\tau=1}^T$ is a fixed noise schedule and $D_T$ is initialized as random noise. The joint marginal at step $\tau$ is

$q(D_\tau\mid D_0) = \mathcal{N}\left(D_\tau; \sqrt{\bar\alpha_\tau}\, D_0, (1-\bar\alpha_\tau)I\right),\quad \bar\alpha_\tau = \prod_{n=1}^\tau (1-\beta_n)$

In classical diffusion models, $D_0$ corresponds to true data (here, the depth map). MonoDiffusion, however, initiates inference from $D_T$ and applies a deterministic reverse (denoising) process parameterized as: $p_\theta(D_{\tau-1}\mid D_\tau, c) = \mathcal{N}\left(D_{\tau-1}; \mu_\theta(D_\tau, \tau, c), \sigma_\tau^2 I\right)$ with the estimate

$D_{\tau-1} = \mu_\theta(D_\tau, \tau, c)$

adopting the DDIM sampler [Song et al., ICLR’21] for determinism by setting $\sigma_\tau = 0$ . Each reverse step refines the estimate toward a coherent depth map, conditioned on visual features $c$ .

2. Pseudo Ground-Truth Diffusion via Self-Supervised Teacher

True depth labels are unavailable in self-supervised settings. MonoDiffusion circumvents this by training a “teacher” depth network, based on Lite-Mono [Zhang et al., CVPR’23], using standard photometric self-supervision. The teacher’s output $D_{\rm pseudo}(p)$ is then treated as a stand-in for ground-truth. MonoDiffusion defines a pseudo-diffusion forward process: $q(D_\tau \mid D_{\rm pseudo}) = \mathcal{N}(D_\tau; \sqrt{\bar\alpha_\tau}\, D_{\rm pseudo}, (1-\bar\alpha_\tau)I)$ During student training, noisy pseudo depth maps $D_\tau$ are sampled from this distribution, and the network learns the reverse denoising process to reconstruct $D_{\rm pseudo}$ , leveraging the teacher’s predictions as self-supervised targets.

3. Masked Visual Condition Mechanism

At each diffusion step, the MonoDiffusion student network is conditioned on multi-scale image feature tokens $\{E_i\}$ extracted from an encoder. To foster robust denoising and reliance on critical image context, a binary mask $M_i$ randomly zeros out $r\%$ of tokens at each scale during training (empirically, $r=20\%$ ). The resulting masked feature tokens are processed via $3\times3$ convolution and upsample condition modules, producing a hierarchy of masked-visual conditions $c_{\rm m}$ . Complete (unmasked) features yield the full-visual condition $c$ . The noise prediction module $\epsilon_\theta(D_\tau,\tau,c,c_{\rm m})$ must estimate clean depth even with incomplete context. Training applies a reconstruction loss: $\mathcal{L}_{rec} = \sum_p |\widehat D^t(p) - D^t(p)|$ where $\widehat D^t(p)$ and $D^t(p)$ are the masked and unmasked predictions, respectively; this enforces invariance to partial context masking.

4. Loss Functions and Training Objective

MonoDiffusion adopts a multiterm loss function to balance self-supervised depth learning, knowledge distillation, and diffusion objectives. The total loss is: $\mathcal{L} = \sum_p \left[ \mathcal{L}_{ph} + \mathcal{L}_{KD} + 0.1\,\mathcal{L}_{rec} + \mathcal{L}_{ddim} \right]$ where:

$\mathcal{L}_{ph}$ : photometric self-supervision (SSIM and L1 reprojection, $\lambda_1 = 1$ )
$\mathcal{L}_{KD}$ : knowledge distillation, defined as

$\mathcal{L}_{KD} = \sum_p \Phi(p) |D^t(p) - D_{\rm pseudo}(p)|$

using validity mask $\Phi(p)$ to exclude unreliable teacher estimates

$\mathcal{L}_{rec}$ : masked-condition reconstruction loss, weighted with $\lambda_3 = 0.1$
$\mathcal{L}_{ddim}$ : DDIM denoising loss, $\|\epsilon - \epsilon_\theta(D_\tau, \tau, c)\|^2$ , with $\lambda_4 = 1$

$\mathcal{L}_{KD}$ employs multi-view consistency checks [Liu et al., TCSVT’23] to generate $\Phi(p)$ , removing pixels with inconsistent depth across multiple viewpoints.

5. Empirical Evaluation and Results

MonoDiffusion has been evaluated on canonical monocular depth estimation benchmarks:

Dataset	Frames (Train/Val/Test)	Metrics	MonoDiffusion (3.1M)	Lite-Mono
KITTI Eigen-split	39,180/4,424/697	Abs Rel / Sq Rel / RMSE / RMSE log / $\delta<1.25$	0.103 / 0.726 / 4.447 / 0.179 / 0.893	0.107 / 0.765 / 4.561 / 0.183 / 0.886
Make3D (zero-shot)	134 test	Abs Rel / Sq Rel / RMSE / RMSE log	0.295 / 2.849 / 6.854 / 0.150	0.305 / 3.060 / 6.981 / 0.158

Abs Rel, Sq Rel, RMSE, RMSE log: lower is better. $\delta<1.25$ : higher is better.

MonoDiffusion outperforms prior state-of-the-art self-supervised methods, showing notable gains, particularly on boundaries (e.g., poles, road signs), long-range region separation, and the coherence of reconstructed 3D point clouds. Zero-shot transfer to Make3D exhibits robust generalization (Shao et al., 2023).

Performance ablation reveals critical dependencies: removing the pseudo ground-truth diffusion leads to non-convergence; incremental addition of pseudo-diffusion, distillation, and masked-condition successively strengthens accuracy; and optimal results arise near 20 inference steps, with under- or over-parameterization degrading accuracy.

6. Analysis, Implications, and Limitations

MonoDiffusion’s integration of diffusion models into self-supervised depth estimation, enabled by pseudo ground-truth mechanisms and context-masked visual conditions, advances the state-of-the-art without ground-truth labels. Distillation using a teacher model introduces dependence on initial teacher quality; unreliable teacher predictions are mitigated via multi-view consistency checks and validity masks. Empirically, the network is sensitive to inference step count and mask ratio, underlining the importance of hyperparameter selection.

A plausible implication is that the conditional diffusion paradigm generalizes to other label-sparse, ill-posed dense prediction tasks—provided pseudo-targets of sufficient fidelity are available. Limitations include failure to converge without high-quality pseudo-diffusion and diminishing returns at higher step counts.

MonoDiffusion demonstrates that bridging generative modeling and self-supervised depth estimation, with auxiliary teacher and condition-masking mechanisms, yields effective and scalable single-image 3D understanding (Shao et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MonoDiffusion.