Curvature-guided dynamic scale networks for Multi-view Stereo

Published 11 Dec 2021 in cs.CV, cs.AI, and cs.GR | (2112.05999v3)

Abstract: Multi-view stereo (MVS) is a crucial task for precise 3D reconstruction. Most recent studies tried to improve the performance of matching cost volume in MVS by designing aggregated 3D cost volumes and their regularization. This paper focuses on learning a robust feature extraction network to enhance the performance of matching costs without heavy computation in the other steps. In particular, we present a dynamic scale feature extraction network, namely, CDSFNet. It is composed of multiple novel convolution layers, each of which can select a proper patch scale for each pixel guided by the normal curvature of the image surface. As a result, CDFSNet can estimate the optimal patch scales to learn discriminative features for accurate matching computation between reference and source images. By combining the robust extracted features with an appropriate cost formulation strategy, our resulting MVS architecture can estimate depth maps more precisely. Extensive experiments showed that the proposed method outperforms other state-of-the-art methods on complex outdoor scenes. It significantly improves the completeness of reconstructed models. As a result, the method can process higher resolution inputs within faster run-time and lower memory than other MVS methods. Our source code is available at url{https://github.com/TruongKhang/cds-mvsnet}.

Abstract PDF Upgrade to Chat

Citations (25)

View on Semantic Scholar

Summary

The paper introduces CDSFNet, a novel network that uses curvature-guided dynamic scale convolution to enhance feature extraction in multi-view stereo.
It integrates a cascaded MVS framework (CDS-MVSNet) that refines depth with dynamic patch scaling to reduce matching ambiguity and improve reconstruction.
Experimental results on DTU and Tanks and Temples benchmarks demonstrate superior completeness and efficiency compared to existing methods.

Curvature-Guided Dynamic Scale Networks for Multi-View Stereo

The paper addresses a significant challenge in multi-view stereo (MVS): the accurate estimation of dense correspondences across high-resolution images for 3D reconstruction. The authors propose a novel approach to feature extraction in MVS, which seeks to overcome common difficulties such as matching ambiguity and computational complexity. They introduce a curvature-guided dynamic scale feature network (CDSFNet) designed to select adaptive patch scales dynamically for each pixel, enhancing the discriminative capability of feature extraction networks without imposing a heavy computational burden.

Dynamic Scale Feature Extraction with CDSFNet

Central to the paper is the introduction of CDSFNet, a feature extraction network that integrates dynamic scale processes into MVS architectures. At the core of CDSFNet is the curvature-guided dynamic scale convolution (CDSConv), which estimates the optimal patch scale for every pixel by leveraging the normal curvature of image surfaces. This innovative design allows the network to adaptively adjust to variations in object scale, textures, and epipolar geometry, improving matching cost computations between reference and source images.

CDSConv Mechanism

The CDSConv layer is characterized by selecting appropriate patch scales based on the normal curvature of image surfaces, determined at multiple candidate scales. The selected scale ensures robust feature representation, enhancing matching precision across varying image resolutions and object scales. The paper highlights the computational efficiency achieved by using learnable kernels to approximate second-order derivatives of image surfaces, thus facilitating dynamic filtering operations crucial for effective MVS performance.

Proposed MVS Framework: CDS-MVSNet

The paper outlines the formulation of a new MVS framework, CDS-MVSNet, which integrates the robust features extracted by CDSFNet into a cascade network architecture for depth estimation. The architecture refines depth maps in a coarse-to-fine manner, utilizing cost volumes based on features from CDSFNet to diminish matching ambiguity. This approach not only improves reconstruction quality but also optimizes computational resources, enabling high-resolution input processing with decreased runtime and memory consumption.

Visibility-Aware Cost Aggregation

A novel aspect of CDS-MVSNet is its application of visibility-aware cost aggregation, informed by pixel-wise visibility predictions derived from normal curvature estimation. This strategy mitigates the impact of occlusions and untextured regions, enhancing cost volume accuracy and subsequently improving depth estimation outcomes.

Experimental Results and Implications

The paper's extensive experimental evaluation demonstrates the effectiveness of the proposed method across benchmark datasets such as DTU and Tanks and Temples. Notably, the CDSFNet architecture outperforms existing MVS methods, achieving superior reconstruction completeness and reduced computational overhead.

Practical Implications

Practically, the method's ability to generate high-quality 3D models from half-resolution images presents significant advantages for real-time applications and resource-constrained environments. The dynamic scale feature extraction provides a versatile solution adaptable to various scene complexities and image resolutions, setting a precedent for future developments in MVS technologies.

Future Directions

The research opens avenues for further exploration into the integration of curvature-guided feature extraction with other computer vision tasks beyond MVS. There is potential for extending the application of these dynamic scaling mechanisms into domains such as semantic segmentation or object detection, where scale variation remains a persistent challenge.

In conclusion, this paper presents a forward-looking approach to MVS by innovatively addressing scale variability through curvature-guided feature extraction, significantly advancing the state-of-the-art in 3D reconstruction. The implications for both theoretical exploration and practical deployment are substantial, paving the way for more adaptive and efficient computer vision systems.

Markdown Report Issue