Versatile Depth Estimator (VDE)
- VDE is an innovative framework that separates camera-invariant depth estimation from camera-specific conversion to yield robust metric predictions.
- It decouples a common relative depth estimator (CRDE) from lightweight per-camera converters (R2MCs), enabling efficient generalization across diverse camera models.
- By integrating canonicalization and joint depth-normal optimization, VDE achieves state-of-the-art performance in 3D reconstruction, visual SLAM, and scene understanding.
The Versatile Depth Estimator (VDE) is an architectural and algorithmic framework for monocular metric depth (and optionally surface normal) estimation that achieves robust, state-of-the-art performance across diverse camera models and scenes by decoupling camera-invariant relative depth extraction from camera-specific metric conversion. Unlike traditional approaches that require network retraining for each camera or exhibit significant failure on novel intrinsics, VDE is engineered to adapt flexibly to multiple cameras, achieving efficient multi-camera deployment and rapid metric generalization. Recent extensions embed VDE principles into foundation models capable of zero-shot metric depth and normal prediction using massive, heterogeneous photo datasets and explicit camera-parameter normalization (Jun et al., 2023, Hu et al., 2024).
1. Architectural Principles
VDE is based on two fundamental modules: the Common Relative Depth Estimator (CRDE) and multiple Relative-to-Metric Converters (R2MCs). The CRDE is a large camera-invariant backbone that produces a normalized or affine-invariant depth representation from RGB imagery. Each R2MC is a lightweight, camera-specific head that transforms the CRDE output into absolute, metrically-scaled depth predictions tailored to individual camera intrinsics. This separation enables the model to serve an arbitrary number of camera types by parameterizing only these small converter modules, drastically reducing overhead compared to fully independent models per camera.
The canonical data flow in original VDE is as follows: where is the input RGB image, are multi-scale encoder features, is the bottleneck pooled feature, and is the frequency-mixed relative depth feature. is a normalized, camera-invariant depth, while transforms this into a metric estimate for each camera (Jun et al., 2023).
2. Learning Camera-Invariant Relative Depth
The CRDE module employs a Swin-Base transformer encoder, pretrained on ImageNet, and three stacked Frequency Mixing Modules (FMM) in the decoder, with skip connections from encoder stages. Each FMM implements attention-based mixing of low-frequency decoder and high-frequency encoder features and upsamples via pixel shuffle.
CRDE outputs normalized depth for each pixel: The ground truth depth is normalized per image to remove camera dependency: The training objective is a scale-invariant log loss across all cameras: with
where , (Jun et al., 2023).
Dedicated to camera-invariant learning, CRDE facilitates exceptionally strong generalization, outperforming established methods such as MiDaS and DPT in relative depth metrics on unseen domains.
3. Camera-Specific Relative-to-Metric Conversion
To recover metric depth, each R2MC learns to calibrate the normalized depth features to a specific camera's scale and offset. In its most basic form, an R2MC is an affine transformation: However, in full VDE, each R2MC is a transformer-style module with two conversion layers that refine the relative-to-metric mapping via attention mechanisms and specialized learned weights.
The R2MC is trained on the corresponding camera's dataset, optimizing the same scale-invariant log loss as the CRDE: This architecture ensures minimal parameter overhead—approximately 1.13% per additional camera ( parameters per R2MC versus for CRDE)—while yielding accurate, camera-aware metric depths without loss of generality (Jun et al., 2023).
4. Canonicalization and Generalization Across Cameras
The central geometric issue in metric monocular depth is focal-length (and more generally, intrinsic) ambiguity. Modern VDE solutions, as exemplified in Metric3Dv2 (Hu et al., 2024), address this via Canonical Camera Space Transformation Modules (CSTM). The core operation is to normalize all imagery (inputs and/or ground-truth outputs) to a canonical focal length . This can be done:
- On the label side: scale ground-truth depths by , and invert at test time.
- On the image side: rescale input images by , while keeping depth in canonical space.
This transformation collapses all focal-length-induced variation, enabling a single model to learn true metric prediction across thousands of camera models, spanning wide to telephoto and fisheye. Without canonicalization, mixed-data training fails to converge to metric scales; with it, zero-shot and multi-camera metric generalization is achieved (Hu et al., 2024).
5. Integrated Depth and Surface Normal Estimation
Metric3Dv2 extends VDE to predict both metric depth and surface normals from single images, jointly optimizing and refining these outputs via an iterative recurrent (ConvGRU) module. The network alternates between updating low-resolution metric depth estimates and their associated (unnormalized) normal vectors, enforcing geometric consistency by loss terms coupling depth and normals: where combines local patch normalization (RPNL), a scale-invariant global loss, and additional metric constraints; is an angular or uncertainty-weighted cosine error, and enforces consistency with normals derived from depth gradients.
This multi-task coupling, along with large-scale, multi-source training on over 16 million images from 10,000+ camera models, enables robust zero-shot generalization to previously unseen camera settings, illumination conditions, and scene types.
6. Experimental Benchmarks and Comparative Performance
VDE has been rigorously benchmarked on both versatile (multi-camera) and single-camera settings. Key results for the multi-camera VDE (Jun et al., 2023) on 10 sub-datasets (NYUv2, DIML, DIODE, ScanNet, SUN-RGBD) include:
| Method | RMSE | REL | δ₁ | Kendall's τ |
|---|---|---|---|---|
| Separate networks | 0.612 | 0.179 | 0.760 | 0.712 |
| Multiple decoders | 0.632 | 0.196 | 0.734 | 0.738 |
| VDE (1 CRDE+10 R2MCs) | 0.559 | 0.164 | 0.795 | 0.768 |
The VDE achieves these results with only 167.3 M parameters compared to 1.498 G for ten independent networks.
On single-camera (NYUv2) evaluation, VDE approaches or surpasses state-of-the-art, e.g., RMSE of 0.315 (second best), REL of 0.088, and best δ < 1.25 of 0.934.
Metric3Dv2 (Hu et al., 2024) demonstrates further advances in generalization:
- NYUv2 (indoor): δ₁=0.975, AbsRel=0.063, RMSE≈0.251 (zero-shot, no fine-tuning)
- KITTI (outdoor): δ₁=0.974, AbsRel=0.052, RMSE≈2.51
- Surface normals: median error 7.1°, mean 13.1° (NYUv2)
- Significant improvements in 3D reconstruction (Chamfer-ℓ₁ and F-score) and SLAM odometry drift when substituting VDE-derived metric depth
VDE outperforms affine-invariant approaches (MiDaS, DPT, LeReS, HDN) on generalization and camera calibration tasks. Ablation studies indicate the necessity of the FMM, R2MC transformer-based mapping, and, in the unified model, the CSTM and joint depth-normal losses.
7. Applications and Broader Implications
The VDE architecture enables flexible deployment for multi-camera systems in robotics, visual SLAM, and 3D scene understanding by minimizing retraining cost for new hardware configurations. The generalization strategies in foundation-model-scale VDE further enable plug-and-play metric 3D reconstruction, single-image metrology, and robust inference on Internet-scale photo domains. For SLAM and visual odometry, metric VDE depths integrated with systems like Droid-SLAM reduce translational drift by more than an order of magnitude, enabling robust metric scale recovery (Hu et al., 2024).
A plausible implication is that the combination of affine-invariant core depth reasoning, camera-parameter canonicalization, and light-weight per-camera adaptation forms a general recipe for extending deep metric estimation to new modalities and sensors, aligning with the recent trend toward geometric foundation models. The VDE methodology robustly addresses camera and domain shift without sacrificing accuracy or efficiency.