MonoDiffusion: Diffusion-based Depth Estimation
- MonoDiffusion is a self-supervised framework that reformulates monocular depth estimation as a conditioned diffusion denoising process, eliminating dependence on true depth labels.
- It incorporates pseudo ground truth from a teacher network through a forward diffusion process to guide training in the absence of ground-truth depth.
- A masked visual condition mechanism enhances robustness by randomly masking image features during training, improving depth boundary reconstruction and overall performance.
MonoDiffusion is a self-supervised monocular depth estimation framework that formulates depth regression as a conditional diffusion-based denoising process. Departing from direct depth map prediction, MonoDiffusion iteratively denoises random depth noise guided by image-derived features, compensating for the absence of true depth ground-truth through a pseudo ground-truth diffusion process and a masked visual condition mechanism. This paradigm enables state-of-the-art depth estimation performance on established benchmarks without requiring depth labels at any training stage (Shao et al., 2023).
1. Depth Estimation as a Conditional Diffusion Process
MonoDiffusion recasts monocular depth estimation as a sequence of denoising operations within the generative framework of diffusion models. The standard forward (noising) process is formalized as: where is a fixed noise schedule and is initialized as random noise. The joint marginal at step is
In classical diffusion models, corresponds to true data (here, the depth map). MonoDiffusion, however, initiates inference from and applies a deterministic reverse (denoising) process parameterized as: with the estimate
adopting the DDIM sampler [Song et al., ICLR’21] for determinism by setting . Each reverse step refines the estimate toward a coherent depth map, conditioned on visual features .
2. Pseudo Ground-Truth Diffusion via Self-Supervised Teacher
True depth labels are unavailable in self-supervised settings. MonoDiffusion circumvents this by training a “teacher” depth network, based on Lite-Mono [Zhang et al., CVPR’23], using standard photometric self-supervision. The teacher’s output is then treated as a stand-in for ground-truth. MonoDiffusion defines a pseudo-diffusion forward process: During student training, noisy pseudo depth maps are sampled from this distribution, and the network learns the reverse denoising process to reconstruct , leveraging the teacher’s predictions as self-supervised targets.
3. Masked Visual Condition Mechanism
At each diffusion step, the MonoDiffusion student network is conditioned on multi-scale image feature tokens extracted from an encoder. To foster robust denoising and reliance on critical image context, a binary mask randomly zeros out of tokens at each scale during training (empirically, ). The resulting masked feature tokens are processed via convolution and upsample condition modules, producing a hierarchy of masked-visual conditions . Complete (unmasked) features yield the full-visual condition . The noise prediction module must estimate clean depth even with incomplete context. Training applies a reconstruction loss: where and are the masked and unmasked predictions, respectively; this enforces invariance to partial context masking.
4. Loss Functions and Training Objective
MonoDiffusion adopts a multiterm loss function to balance self-supervised depth learning, knowledge distillation, and diffusion objectives. The total loss is: where:
- : photometric self-supervision (SSIM and L1 reprojection, )
- : knowledge distillation, defined as
using validity mask to exclude unreliable teacher estimates
- : masked-condition reconstruction loss, weighted with
- : DDIM denoising loss, , with
employs multi-view consistency checks [Liu et al., TCSVT’23] to generate , removing pixels with inconsistent depth across multiple viewpoints.
5. Empirical Evaluation and Results
MonoDiffusion has been evaluated on canonical monocular depth estimation benchmarks:
| Dataset | Frames (Train/Val/Test) | Metrics | MonoDiffusion (3.1M) | Lite-Mono |
|---|---|---|---|---|
| KITTI Eigen-split | 39,180/4,424/697 | Abs Rel / Sq Rel / RMSE / RMSE log / | 0.103 / 0.726 / 4.447 / 0.179 / 0.893 | 0.107 / 0.765 / 4.561 / 0.183 / 0.886 |
| Make3D (zero-shot) | 134 test | Abs Rel / Sq Rel / RMSE / RMSE log | 0.295 / 2.849 / 6.854 / 0.150 | 0.305 / 3.060 / 6.981 / 0.158 |
Abs Rel, Sq Rel, RMSE, RMSE log: lower is better. : higher is better.
MonoDiffusion outperforms prior state-of-the-art self-supervised methods, showing notable gains, particularly on boundaries (e.g., poles, road signs), long-range region separation, and the coherence of reconstructed 3D point clouds. Zero-shot transfer to Make3D exhibits robust generalization (Shao et al., 2023).
Performance ablation reveals critical dependencies: removing the pseudo ground-truth diffusion leads to non-convergence; incremental addition of pseudo-diffusion, distillation, and masked-condition successively strengthens accuracy; and optimal results arise near 20 inference steps, with under- or over-parameterization degrading accuracy.
6. Analysis, Implications, and Limitations
MonoDiffusion’s integration of diffusion models into self-supervised depth estimation, enabled by pseudo ground-truth mechanisms and context-masked visual conditions, advances the state-of-the-art without ground-truth labels. Distillation using a teacher model introduces dependence on initial teacher quality; unreliable teacher predictions are mitigated via multi-view consistency checks and validity masks. Empirically, the network is sensitive to inference step count and mask ratio, underlining the importance of hyperparameter selection.
A plausible implication is that the conditional diffusion paradigm generalizes to other label-sparse, ill-posed dense prediction tasks—provided pseudo-targets of sufficient fidelity are available. Limitations include failure to converge without high-quality pseudo-diffusion and diminishing returns at higher step counts.
MonoDiffusion demonstrates that bridging generative modeling and self-supervised depth estimation, with auxiliary teacher and condition-masking mechanisms, yields effective and scalable single-image 3D understanding (Shao et al., 2023).