Versatile Depth Estimator (VDE)

Updated 7 February 2026

VDE is an innovative framework that separates camera-invariant depth estimation from camera-specific conversion to yield robust metric predictions.
It decouples a common relative depth estimator (CRDE) from lightweight per-camera converters (R2MCs), enabling efficient generalization across diverse camera models.
By integrating canonicalization and joint depth-normal optimization, VDE achieves state-of-the-art performance in 3D reconstruction, visual SLAM, and scene understanding.

The Versatile Depth Estimator (VDE) is an architectural and algorithmic framework for monocular metric depth (and optionally surface normal) estimation that achieves robust, state-of-the-art performance across diverse camera models and scenes by decoupling camera-invariant relative depth extraction from camera-specific metric conversion. Unlike traditional approaches that require network retraining for each camera or exhibit significant failure on novel intrinsics, VDE is engineered to adapt flexibly to multiple cameras, achieving efficient multi-camera deployment and rapid metric generalization. Recent extensions embed VDE principles into foundation models capable of zero-shot metric depth and normal prediction using massive, heterogeneous photo datasets and explicit camera-parameter normalization (Jun et al., 2023, Hu et al., 2024).

1. Architectural Principles

VDE is based on two fundamental modules: the Common Relative Depth Estimator (CRDE) and multiple Relative-to-Metric Converters (R2MCs). The CRDE is a large camera-invariant backbone that produces a normalized or affine-invariant depth representation from RGB imagery. Each R2MC is a lightweight, camera-specific head that transforms the CRDE output into absolute, metrically-scaled depth predictions tailored to individual camera intrinsics. This separation enables the model to serve an arbitrary number of camera types by parameterizing only these small converter modules, drastically reducing overhead compared to fully independent models per camera.

The canonical data flow in original VDE is as follows: $I \;\xrightarrow{\text{Swin-Encoder}\; \{E_1,\dots,E_4\}} \xrightarrow{\text{PyramidPooling}} D_{1/32} \xrightarrow{\text{3 × FMM + PixelShuffle}} R_{1/4} \begin{cases} \xrightarrow{\text{Swin Block + head}} & \widehat{N} = \text{normalized depth map} \ \xrightarrow{h_c} & \widehat{D}_m^{(c)} = \text{metric depth for camera } c \end{cases}$ where $I$ is the input RGB image, $\{E_1, …, E_4\}$ are multi-scale encoder features, $D_{1/32}$ is the bottleneck pooled feature, and $R_{1/4}$ is the frequency-mixed relative depth feature. $\widehat{N}$ is a normalized, camera-invariant depth, while $h_c$ transforms this into a metric estimate for each camera (Jun et al., 2023).

2. Learning Camera-Invariant Relative Depth

The CRDE module employs a Swin-Base transformer encoder, pretrained on ImageNet, and three stacked Frequency Mixing Modules (FMM) in the decoder, with skip connections from encoder stages. Each FMM implements attention-based mixing of low-frequency decoder and high-frequency encoder features and upsamples via pixel shuffle.

CRDE outputs normalized depth for each pixel: $\widehat{N}(x) = g(I)(x)$ The ground truth depth $D(x)$ is normalized per image to remove camera dependency: $N(x) = \frac{D(x) - \mu}{\sigma}, \quad \mu = \frac{1}{|\Omega|}\sum_{x\in\Omega} D(x), \quad \sigma^2 = \frac{1}{|\Omega|} \sum_{x\in\Omega} (D(x) - \mu)^2$ The training objective is a scale-invariant log loss across all cameras: $L_{\text{CRDE}} = \sum_{c=1}^K \sum_{(I,D)\in \mathcal{T}_c} \ell(\widehat{N}, N)$ with

$\ell(\widehat{D}, D) = \alpha \sqrt{\frac{1}{|\Omega|} \sum_{i\in\Omega} e_i^2 - \frac{\lambda}{|\Omega|^2}\left(\sum_{i\in\Omega} e_i\right)^2}, \quad e_i = \log \widehat{d}_i - \log d_i$

where $\alpha=10$ , $\lambda=0.85$ (Jun et al., 2023).

Dedicated to camera-invariant learning, CRDE facilitates exceptionally strong generalization, outperforming established methods such as MiDaS and DPT in relative depth metrics on unseen domains.

3. Camera-Specific Relative-to-Metric Conversion

To recover metric depth, each R2MC $h_c$ learns to calibrate the normalized depth features to a specific camera's scale and offset. In its most basic form, an R2MC is an affine transformation: $\widehat{D}_m^{(c)}(x) = a_c\,\widehat{N}(x) + b_c$ However, in full VDE, each R2MC is a transformer-style module with two conversion layers that refine the relative-to-metric mapping via attention mechanisms and specialized learned weights.

The R2MC is trained on the corresponding camera's dataset, optimizing the same scale-invariant log loss as the CRDE: $L_{\text{R2MC}}^{(c)} = \sum_{(I,D)\in \mathcal{T}_c} \ell\left(\widehat{D}_m^{(c)}, D\right)$ This architecture ensures minimal parameter overhead—approximately 1.13% per additional camera ( $\sim 1.7\,\text{M}$ parameters per R2MC versus $\sim 149.8\,\text{M}$ for CRDE)—while yielding accurate, camera-aware metric depths without loss of generality (Jun et al., 2023).

4. Canonicalization and Generalization Across Cameras

The central geometric issue in metric monocular depth is focal-length (and more generally, intrinsic) ambiguity. Modern VDE solutions, as exemplified in Metric3Dv2 (Hu et al., 2024), address this via Canonical Camera Space Transformation Modules (CSTM). The core operation is to normalize all imagery (inputs and/or ground-truth outputs) to a canonical focal length $f^c$ . This can be done:

On the label side: scale ground-truth depths by $\omega_d = f^c / f$ , and invert at test time.
On the image side: rescale input images by $\omega_r = f^c / f$ , while keeping depth in canonical space.

This transformation collapses all focal-length-induced variation, enabling a single model to learn true metric prediction across thousands of camera models, spanning wide to telephoto and fisheye. Without canonicalization, mixed-data training fails to converge to metric scales; with it, zero-shot and multi-camera metric generalization is achieved (Hu et al., 2024).

5. Integrated Depth and Surface Normal Estimation

Metric3Dv2 extends VDE to predict both metric depth and surface normals from single images, jointly optimizing and refining these outputs via an iterative recurrent (ConvGRU) module. The network alternates between updating low-resolution metric depth estimates and their associated (unnormalized) normal vectors, enforcing geometric consistency by loss terms coupling depth and normals: $L = w_d L_{\text{depth}}(D_c, D_c^*) + w_n L_{\text{normal}}(N, N^*) + w_{d\text{-}n} L_{\text{consist}}(N, D)$ where $L_{\text{depth}}$ combines local patch normalization (RPNL), a scale-invariant global loss, and additional metric constraints; $L_{\text{normal}}$ is an angular or uncertainty-weighted cosine error, and $L_{\text{consist}}$ enforces consistency with normals derived from depth gradients.

This multi-task coupling, along with large-scale, multi-source training on over 16 million images from 10,000+ camera models, enables robust zero-shot generalization to previously unseen camera settings, illumination conditions, and scene types.

6. Experimental Benchmarks and Comparative Performance

VDE has been rigorously benchmarked on both versatile (multi-camera) and single-camera settings. Key results for the multi-camera VDE (Jun et al., 2023) on 10 sub-datasets (NYUv2, DIML, DIODE, ScanNet, SUN-RGBD) include:

Method	RMSE	REL	δ₁	Kendall's τ
Separate networks	0.612	0.179	0.760	0.712
Multiple decoders	0.632	0.196	0.734	0.738
VDE (1 CRDE+10 R2MCs)	0.559	0.164	0.795	0.768

The VDE achieves these results with only 167.3 M parameters compared to 1.498 G for ten independent networks.

On single-camera (NYUv2) evaluation, VDE approaches or surpasses state-of-the-art, e.g., RMSE of 0.315 (second best), REL of 0.088, and best δ < 1.25 of 0.934.

Metric3Dv2 (Hu et al., 2024) demonstrates further advances in generalization:

NYUv2 (indoor): δ₁=0.975, AbsRel=0.063, RMSE≈0.251 (zero-shot, no fine-tuning)
KITTI (outdoor): δ₁=0.974, AbsRel=0.052, RMSE≈2.51
Surface normals: median error 7.1°, mean 13.1° (NYUv2)
Significant improvements in 3D reconstruction (Chamfer-ℓ₁ and F-score) and SLAM odometry drift when substituting VDE-derived metric depth

VDE outperforms affine-invariant approaches (MiDaS, DPT, LeReS, HDN) on generalization and camera calibration tasks. Ablation studies indicate the necessity of the FMM, R2MC transformer-based mapping, and, in the unified model, the CSTM and joint depth-normal losses.

7. Applications and Broader Implications

The VDE architecture enables flexible deployment for multi-camera systems in robotics, visual SLAM, and 3D scene understanding by minimizing retraining cost for new hardware configurations. The generalization strategies in foundation-model-scale VDE further enable plug-and-play metric 3D reconstruction, single-image metrology, and robust inference on Internet-scale photo domains. For SLAM and visual odometry, metric VDE depths integrated with systems like Droid-SLAM reduce translational drift by more than an order of magnitude, enabling robust metric scale recovery (Hu et al., 2024).

A plausible implication is that the combination of affine-invariant core depth reasoning, camera-parameter canonicalization, and light-weight per-camera adaptation forms a general recipe for extending deep metric estimation to new modalities and sensors, aligning with the recent trend toward geometric foundation models. The VDE methodology robustly addresses camera and domain shift without sacrificing accuracy or efficiency.

Markdown Report Issue Upgrade to Chat

References (2)

Versatile Depth Estimator Based on Common Relative Depth Estimation and Camera-Specific Relative-to-Metric Depth Conversion (2023)

Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Versatile Depth Estimator (VDE).