TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo

Published 14 Nov 2021 in cs.CV and cs.RO | (2111.07418v1)

Abstract: In this paper, we present TANDEM a real-time monocular tracking and dense mapping framework. For pose estimation, TANDEM performs photometric bundle adjustment based on a sliding window of keyframes. To increase the robustness, we propose a novel tracking front-end that performs dense direct image alignment using depth maps rendered from a global model that is built incrementally from dense depth predictions. To predict the dense depth maps, we propose Cascade View-Aggregation MVSNet (CVA-MVSNet) that utilizes the entire active keyframe window by hierarchically constructing 3D cost volumes with adaptive view aggregation to balance the different stereo baselines between the keyframes. Finally, the predicted depth maps are fused into a consistent global map represented as a truncated signed distance function (TSDF) voxel grid. Our experimental results show that TANDEM outperforms other state-of-the-art traditional and learning-based monocular visual odometry (VO) methods in terms of camera tracking. Moreover, TANDEM shows state-of-the-art real-time 3D reconstruction performance.

Abstract PDF Upgrade to Chat

Citations (61)

View on Semantic Scholar

Summary

Overview of TANDEM: Tracking and Dense Mapping in Real-time Using Deep Multi-view Stereo

The paper titled "TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo" presents a significant advancement in the realm of monocular Simultaneous Localization and Mapping (SLAM). TANDEM is introduced as a robust real-time framework combining monocular tracking with dense mapping enhanced by a novel CNN-based approach, Cascade View-Aggregation MVSNet (CVA-MVSNet), which enables efficient depth prediction through multi-view stereo techniques. This integration offers a seamless blend of legacy optimization-based visual odometry with cutting-edge neural network predictions, establishing a state-of-the-art performance in trackings, such as pose estimation, and in dense 3D scene reconstruction.

System Architecture and Functionality

TANDEM consists of three integral components: monocular visual odometry, dense depth estimation using CVA-MVSNet, and volumetric mapping. It enhances tracking robustness through dense direct image alignment, utilizing depth maps generated from an incrementally built TSDF-based global model. This novel integration allows for utilizing dense depth maps rendered from the 3D model for visual odometry, thereby overcoming the limitations of sparse depth information within traditional frameworks like DSO.

CVA-MVSNet is at the core of TANDEM's capabilities, offering hierarchical depth prediction by constructing 3D cost volumes with adaptive view aggregation. This technique not only considers different stereo baselines among keyframes but incorporates a self-adaptive weighting scheme to effectively handle occlusion and overlapping images. The resultant depth maps are subsequently fused into a consistent TSDF voxel grid, achieving notable improvements in real-time 3D reconstruction quality over contemporary learning-based or traditional methods.

Experimental Results

Extensive evaluation of TANDEM was conducted using both synthetic and real-world datasets. On tests using EuRoC and ICL-NUIM datasets for camera tracking, TANDEM indicated superior tracking accuracy compared to established techniques like ORB-SLAM2 and DeepFactors. Notably, TANDEM managed to maintain accuracy without the use of inertial sensors, a requirement for competing methods such as Code VIO. In terms of 3D reconstruction, TANDEM demonstrated enhanced detail capture and completeness, surpassing traditional monocular methods and performing comparably to RGB-D approaches which naturally acquire dense depth maps.

Implications and Future Directions

This paper advances the practical applicability and theoretical underpinnings of monocular SLAM systems by successfully marrying traditional optimization approaches with deep learning-based stereo reconstruction. The implications of TANDEM are profound within autonomous robot navigation, augmented reality applications, and real-time environmental mapping. The DNN component of TANDEM, primarily CVA-MVSNet, holds promise for further refinement in network architectures or self-supervised learning techniques, enabling even better generalization to diverse environmental scenarios.

Potential future developments could focus on optimizing the system for even lower computational requirements, ensuring deployment viability on embedded systems within mobile platforms. Exploring techniques to mitigate failure cases such as pure rotational motion or dynamic environments may also enhance robustness. As neural networks and visual odometry methods evolve, TANDEM provides a robust foundation for integrating rich visual data with navigational intelligence, paving the way for more autonomous systems capable of perceiving and interpreting complex environments with precision.