Depth Anything V2

Published 13 Jun 2024 in cs.CV | (2406.09414v1)

Abstract: This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.

Abstract PDF HTML Upgrade to Chat

Citations (104)

View on Semantic Scholar

Summary

The paper introduces Depth Anything V2, which enhances monocular depth estimation by integrating synthetic data with pseudo-labeled real images.
It implements a teacher-student architecture using advanced vision encoders like DINOv2-G to bridge the synthetic-to-real domain gap.
The new DA-2K benchmark validates the model's superior accuracy and robustness across varied and challenging real-world settings.

Depth Anything V2: Advancements in Monocular Depth Estimation

Introduction

The paper "Depth Anything V2" (arXiv ID: (2406.09414)) introduces a new method for monocular depth estimation, namely Depth Anything V2, which aims to enhance the precision and robustness of depth predictions compared to its predecessor, Depth Anything V1. This improvement is achieved by leveraging synthetic images, scaling up model capacity, and utilizing pseudo-labeled real images to train student models. The proposed approach promises faster inference, fewer parameters, and higher accuracy compared to existing models based on Stable Diffusion.

Figure 1: Depth Anything V2 significantly outperforms V1~\cite{depth_anything}.

Revisiting Real and Synthetic Data

The research details the drawbacks of real labeled images, particularly the presence of label noise and the inability to capture fine details in depth maps. Figures emphasize these shortcomings, such as missing details in tree and chair structures (Figure 2). In contrast, synthetic images provide highly precise depth annotations, capable of capturing intricate details and challenging scenarios like transparent and reflective surfaces.

Figure 2: Coarse depth of real data (HRWSI~\cite{hrwsi).

The adoption of synthetic data addresses the limitations of real datasets, but poses its own challenges, such as distribution shifts between synthetic and real images and limited scene diversity. The paper discusses how these challenges can be mitigated by scaling up the teacher model and leveraging large-scale pseudo-labeled real images to bridge the domain gap and enhance scene coverage.

Model Architecture and Training

Depth Anything V2 emphasizes training a capable teacher model using synthetic images, followed by generating pseudo-depth labels for a vast collection of unlabeled real images. Student models are trained using these pseudo-labeled images, facilitating robust generalization to real-world scenarios.

Figure 3: Depth Anything V2 pipeline.

The strategy of combining synthetic and pseudo-labeled real data enhances robustness and maintains fine-grained predictions even in complex scenes. Evaluation of various vision encoders highlights DINOv2-G as the most effective in the synthetic-to-real transfer, achieving superior predictions compared to other encoder models (Figure 4).

Figure 4: Qualitative comparison of different vision encoders on synthetic-to-real transfer.

Evaluation and Benchmarking

To address the noise and lack of diversity in existing test benchmarks, the authors propose DA-2K, a new evaluation benchmark featuring high-resolution images with precise sparse annotations. This benchmark offers a diversified set of scenarios, including indoor, outdoor, and adverse conditions.

Figure 5: Visualization of images and precise sparse annotations on our benchmark DA-2K.

Zero-shot depth estimation metrics across conventional benchmarks show Depth Anything V2 holds up well against MiDaS and Depth Anything V1. However, DA-2K results reveal superior accuracy of V2 models in diverse test scenarios, underscoring the benefits of using pseudo-labeled data.

Figure 6: Visualization of widely adopted but indeed noisy test benchmark~\cite{nyud).

Conclusion

Depth Anything V2 provides a compelling solution for monocular depth estimation, leveraging synthetic images for training robust models and bridging the domain gap with pseudo-labeled real images. This approach not only enhances depth accuracy and efficiency but also broadens the applicability of MDE models across varied scenarios. Future work should focus on further expanding the diversity of synthetic data and refining pseudo-labeling techniques to push the boundaries of depth estimation technology.