Masked Depth Modeling (MDM)

Updated 28 January 2026

Masked Depth Modeling (MDM) is a technique that formulates depth estimation as a masked token recovery task by reconstructing missing depth using contextual cues.
It employs architectures like Vision Transformers and CNNs to fuse RGB and depth inputs, achieving robust sparse-to-dense completion even under high masking ratios.
MDM uses self-supervised reconstruction losses over valid pixels, yielding strong geometric priors that enhance spatial perception and support downstream tasks.

Masked Depth Modeling (MDM) refers to a family of techniques in which depth estimation or depth completion is formulated as a “masked token recovery” or “inpainting” problem. In these methods, portions of the available depth signal—whether naturally missing (e.g., due to sensor limitations) or intentionally occluded via random synthetic masking—are withheld from the model. The task is then to reconstruct the original depth values in the masked regions by leveraging context from observed pixels and, in many settings, additional RGB or multi-modal inputs. MDM provides a self-supervised or semi-supervised signal, enables robust sparse-to-dense completion even under extreme observation loss, and yields strong geometric priors that transfer across domains and downstream spatial perception tasks.

1. Mathematical Foundations and Core Principles

Let $I \in \mathbb{R}^{H \times W \times 3}$ denote a color (RGB) image, $D \in \mathbb{R}^{H \times W}$ the aligned “raw” or “sparse” depth map, and $M \in \{0,1\}^{H \times W}$ a binary mask with $M(p) = 1$ for masked pixels. MDM strategies define the observed input as a pair $(I, D_\text{masked})$ , where $D_\text{masked} = D \odot (1 - M) + b \odot M$ and $b$ is either a numeric placeholder or a learned mask embedding. The modeling objective is to predict a dense (completed) depth map $\hat D = F(I, D_\text{masked})$ such that $\hat D$ is close to the ground truth $D$ at all relevant pixels.

The canonical MDM loss is a reconstruction term, often $D \in \mathbb{R}^{H \times W}$ 0 or MSE, restricted to ground-truth valid locations: $D \in \mathbb{R}^{H \times W}$ 1 where $D \in \mathbb{R}^{H \times W}$ 2 is the indicator of pixels with valid ground truth. No additional geometric priors (e.g., smoothness) are required for effective learning in large-scale settings (Tan et al., 25 Jan 2026). High masking ratios (up to 60–90%) are typical, forcing models to rely on global contextual reasoning (Sun et al., 2024, Tan et al., 25 Jan 2026, Yan et al., 2022).

2. Architectures: Transformers, Convolutions, and Fusion Mechanisms

MDM systems employ various architectures tailored to the masking and reconstruction paradigm:

Vision Transformers (ViT): In large-scale MDM (LingBot-Depth (Tan et al., 25 Jan 2026), indoor depth completion (Sun et al., 2024)), ViTs ingest patch-embedded RGB-D tokens. Masking is performed at the token (patch) level, removing up to 75% of depth patches while preserving full RGB context. Depth and/or RGB tokens are fused with cross-modal attention to align geometric and visual semantic features.
CNN and Mask-Gated Convolution: Models such as MagaConv (Huang et al., 2024) implement convolution kernels modulated by a mask-driven gating function, ensuring that feature extraction respects the underlying missing data pattern. Iterative mask updates enable progressive “growing” of valid depth support.
Token Fusion Decoders: For RGB-D fusion after pre-training, token-level addition or fusion MLPs are applied to combine ViT-encoded features with raw depth projections (Sun et al., 2024).
U-Net and Guided Convolutions: In panoramic and multi-modal MDM (M³PT (Yan et al., 2022)), backbone architectures can be standard encoder-decoders (GuideNet, UniFuse, HoHoNet), with fusion mechanisms such as spatially-adaptive convolutions guided by the RGB stream.
Temporal Transformers: Video MDMs employ spatial-temporal transformers to inpaint masked frames using information from temporally adjacent, unmasked frames. High masking ratios (e.g., 83%) force the learning of robust inter-frame consistency (Wang et al., 2022).

3. Masking Strategies and Pre-Training Protocols

MDM leverages both natural and synthetic masking:

Natural Masking: Directly reflects hardware failures—pixels where the sensor fails to return depth or where geometry is fundamentally ambiguous (specular, transparent, distant surfaces) (Tan et al., 25 Jan 2026).
Synthetic Masking: Introduced patch-wise or pixel-wise. Strategies include random masking (75–90%), block-wise (structured) masking, or operational masking specifically targeted at the regions most relevant for the downstream task (Sun et al., 2024, Yan et al., 2022).
Disjoint Patch Partitioning: In semi-supervised schemes (MaskingDepth (Baek et al., 2022)), strongly-augmented branches apply $D \in \mathbb{R}^{H \times W}$ 3-way disjoint patch masking to ensure global scale consistency and robust feature learning.
Temporal Masking: In video models, frames are randomly masked, sometimes leaving only two visible per sequence, enforcing temporal inpainting (Wang et al., 2022).

During pre-training, the network processes masked inputs and learns to inpaint or reconstruct missing depth only at masked (and valid) locations, often with no change of architecture at fine-tuning—every weight is exposed to both masked and unmasked signals (Yan et al., 2022, Sun et al., 2024).

4. Loss Functions, Optimization, and Theoretical Equivalences

MDM typically uses task-aligned self-supervised objectives:

Masked Reconstruction Loss: $D \in \mathbb{R}^{H \times W}$ 4 or MSE losses on valid, masked locations; no perceptual or adversarial terms are required at scale (Tan et al., 25 Jan 2026, Sun et al., 2024, Yan et al., 2022).
Additional Consistency Losses: For monocular or semi-supervised MDM, feature- and pseudo-label–consistency losses regularize the network under strong masking (Baek et al., 2022).
Hybrid AO-AR Objectives: In the language modeling analog of MDM, a mixture of left-to-right and randomly permuted orders for masked token prediction is employed to balance convergence and generalization (Xue et al., 24 Jun 2025).
Block-Wise and Progressive Masking Schedules: For large language MDMs (T $D \in \mathbb{R}^{H \times W}$ 5 (Xia et al., 16 Jan 2026)), reinforcement learning schedules progressively increase block size to enable highly parallel decoding with minimal reasoning loss.
Theoretical Equivalence: The MDM/Any-Order AR (AO-AR) objectives are formally equivalent to diffusion-style random masking followed by masked-reconstruction ELBO optimization, unifying masked LLMs, masked autoencoders, and masked depth inpainting under a common probabilistic frame (Xue et al., 24 Jun 2025).

5. Empirical Performance, Benchmarks, and Ablations

MDM reproducibly yields state-of-the-art or competitive results in diverse depth estimation settings:

RGB-D Depth Completion: On Matterport3D, ViT-based MDM achieves RMSE = 0.690 m, ME = 0.206 m, SSIM = 0.765, and $D \in \mathbb{R}^{H \times W}$ 6 (Sun et al., 2024). Ablations show pre-training and deep encoders are key.
Panoramic and Sparse-Depth Completion: M³PT achieves a 29–54% reduction in RMSE and MAE over baselines on Matterport3D, Stanford2D3D, and 3D60, without any architectural change between pre-training and fine-tuning (Yan et al., 2022).
Video Consistency: In FMNet, a masked-frame MDM delivers a 47% reduction in temporal flicker (OPW) over prior art, simultaneously maintaining spatial accuracy (Wang et al., 2022).
Semi-supervised and Domain Adaptation: MaskingDepth closes the supervised-unlabeled gap (e.g., on NYU-Depth-v2, yielding AbsRel = 0.104, RMSE = 0.372, $D \in \mathbb{R}^{H \times W}$ 7 with large unlabeled supplementation) (Baek et al., 2022).
Robotics and Downstream Alignment: LingBot-Depth’s MDM backbones halve point-cloud error in 3D reconstruction and improve 3D grasp success rates for challenging objects compared to raw depth input (Tan et al., 25 Jan 2026).

Empirical studies confirm that high mask ratios, joint RGB-depth fusion, and restricted loss to valid pixels enable robust spatial priors and cross-modal alignment, while progressive curriculum or hybrid-ordering significantly mitigates convergence bottlenecks in sequence modeling (Sun et al., 2024, Xue et al., 24 Jun 2025).

6. Variants and Extensions Across Modalities and Paradigms

MDM spans multiple modalities and paradigms:

Multi-modal Masked Training: Simultaneous random masking of both RGB and depth channels (as in M³PT) harnesses multi-modal redundancy and amplifies representational strength (Yan et al., 2022).
Object-Masked Counterfactuals: Some MDM variants accept user- or detector-supplied object masks, reconstructing “counterfactual” depth for arbitrary removals, including out-of-distribution objects (Issaranon et al., 2019).
Temporal and Spatial Masked Modeling: Both patch-level (spatial) and frame-level (temporal) masking yield robust filling for spatially large missing regions or temporally inconsistent sequences (Wang et al., 2022).
Semi-supervised and Uncertainty-weighted Consistency: Disjoint patch masking and prediction–consistency regularization, weighted by predicted uncertainty, allow MDMs to fully exploit unlabeled or domain-shifted data (Baek et al., 2022).
Masked Diffusion in Language: In masked diffusion LLMs, random block-wise masking and progressive schedule learning trade off between AR accuracy and high-parallelism generation (Xia et al., 16 Jan 2026, Xue et al., 24 Jun 2025).

7. Significance, Limitations, and Conceptual Impact

MDM unifies previously siloed approaches to completion, inpainting, and self-supervised representation learning, establishing masked depth as (i) an effective self-supervised signal, (ii) a practical vehicle for efficient large-scale pre-training, and (iii) a modular primitive compatible with transformers, convolutions, and multi-modal fusions (Tan et al., 25 Jan 2026, Sun et al., 2024, Yan et al., 2022, Xue et al., 24 Jun 2025).

Limitations include possible sensitivity to mask generation policies, error propagation in extremely sparse observation regimes, and, in high-resolution tokenized MDM, compute and memory scaling. Disjoint masking strategies and hybrid scheduling soften scale ambiguity and facilitate convergence. Unlike simple zero-imputation, adaptive masked modeling respects the structure of missingness and enforces latent-space regularization, yielding robust cross-task geometric priors.

Across spatial, temporal, and semantic domains, MDM has demonstrated strong transfer not only in vision (RGB-D inpainting, scene completion, video depth, panoramic reconstruction) but also in autoregressive and encoder-decoder text paradigms. The flexibility and extensibility of MDM anchor it as a foundational paradigm in spatial perception, autonomous sensing, and next-generation data-driven 3D reasoning (Tan et al., 25 Jan 2026, Sun et al., 2024, Xue et al., 24 Jun 2025).