- The paper introduces VGGT-SLAM, which optimizes dense RGB SLAM by leveraging the SL(4) manifold to overcome projective ambiguities in monocular setups.
- It employs 15-DOF homography estimation and nonlinear factor graph optimization to align submaps accurately and maintain global map consistency.
- Experimental results demonstrate competitive performance on standard benchmarks, with future work aimed at addressing degeneracies in homogeneous planar scenes.
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
Introduction
The paper introduces VGGT-SLAM, a novel dense RGB SLAM system that utilizes the VGGT framework for feed-forward scene reconstruction. Unlike previous methods that align submaps using similarity transformations, VGGT-SLAM optimizes over the special linear group, SL(4), addressing projective ambiguities inherent in monocular setups with uncalibrated cameras. This approach allows the system to reconstruct large-scale scenes that are infeasible for VGGT due to the limits in GPU memory.
SLAM Framework
VGGT-SLAM constructs submaps incrementally from monocular images, optimizing their alignment globally and incrementally. The process begins by selecting keyframes based on disparity and proceeds with VGGT to construct submaps. The alignment is performed using 15-DOF homographies between submaps calculated directly, without requiring the estimation of correspondences. Loop closures further refine the alignment.
Homography Estimation
The crux of the paper is the homography estimation, which addresses the limits of similarity transforms in dealing with reconstruction ambiguities. VGGT-SLAM calculates 15-DOF homographies that belong to the SL(4) manifold, allowing a transformation that accounts for shear, stretch, and perspective distortions. The system employs RANSAC with a 5-point solver to ensure robustness against inaccurate depth estimates.
*Figure 1: Ablation studies: (a) Effect of loop closure (LC) on absolute trajectory error (ATE) across different window sizes, w, in TUM~\cite{Sturm12iros-TUM-RGB-D.$
Backend Optimization
VGGT-SLAM employs nonlinear factor graph optimization to achieve consistency in global map reconstruction. The optimization targets maximum a posteriori estimation, and Levenberg-Marquardt is used to iteratively solve for the best homography alignments. Here, VGGT-SLAM delineates a transformation path directly on the SL(4) manifold, a novel approach in SLAM applications.
Experimental Results
VGGT-SLAM demonstrates competitive results on the evaluation benchmarks, often surpassing or matching state-of-the-art uncalibrated SLAM systems. The system maintains accuracy in sparse mapping and dense point cloud reconstruction across various RGB-D datasets.
Limitations and Future Work
While effective, VGGT-SLAM faces challenges, such as handling homogeneous planar scenes and potential degeneracies in homography calculations under consistent local point errors. Future improvements may involve integrating robust estimation techniques like ray-based matching to mitigate these issues.
Conclusion
VGGT-SLAM represents a significant shift in SLAM methodologies, introducing the optimization on SL(4) manifold to address projective ambiguities. Its success with dense monocular SLAM systems portends further advancements in feed-forward scene reconstruction. This opens avenues for more scalable systems that can leverage broader visual information across extensive real-world environments.