VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Published 18 May 2025 in cs.CV | (2505.12549v2)

Abstract: We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.

Abstract PDF Upgrade to Chat

Summary

The paper introduces VGGT-SLAM, which optimizes dense RGB SLAM by leveraging the SL(4) manifold to overcome projective ambiguities in monocular setups.
It employs 15-DOF homography estimation and nonlinear factor graph optimization to align submaps accurately and maintain global map consistency.
Experimental results demonstrate competitive performance on standard benchmarks, with future work aimed at addressing degeneracies in homogeneous planar scenes.

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Introduction

The paper introduces VGGT-SLAM, a novel dense RGB SLAM system that utilizes the VGGT framework for feed-forward scene reconstruction. Unlike previous methods that align submaps using similarity transformations, VGGT-SLAM optimizes over the special linear group, SL(4), addressing projective ambiguities inherent in monocular setups with uncalibrated cameras. This approach allows the system to reconstruct large-scale scenes that are infeasible for VGGT due to the limits in GPU memory.

SLAM Framework

VGGT-SLAM constructs submaps incrementally from monocular images, optimizing their alignment globally and incrementally. The process begins by selecting keyframes based on disparity and proceeds with VGGT to construct submaps. The alignment is performed using 15-DOF homographies between submaps calculated directly, without requiring the estimation of correspondences. Loop closures further refine the alignment.

Homography Estimation

The crux of the paper is the homography estimation, which addresses the limits of similarity transforms in dealing with reconstruction ambiguities. VGGT-SLAM calculates 15-DOF homographies that belong to the SL(4) manifold, allowing a transformation that accounts for shear, stretch, and perspective distortions. The system employs RANSAC with a 5-point solver to ensure robustness against inaccurate depth estimates. *Figure 1: Ablation studies: (a) Effect of loop closure (LC) on absolute trajectory error (ATE) across different window sizes, $w$ , in TUM~\cite{Sturm12iros-TUM-RGB-D.$

Backend Optimization

VGGT-SLAM employs nonlinear factor graph optimization to achieve consistency in global map reconstruction. The optimization targets maximum a posteriori estimation, and Levenberg-Marquardt is used to iteratively solve for the best homography alignments. Here, VGGT-SLAM delineates a transformation path directly on the SL(4) manifold, a novel approach in SLAM applications.

Experimental Results

VGGT-SLAM demonstrates competitive results on the evaluation benchmarks, often surpassing or matching state-of-the-art uncalibrated SLAM systems. The system maintains accuracy in sparse mapping and dense point cloud reconstruction across various RGB-D datasets.

Limitations and Future Work

While effective, VGGT-SLAM faces challenges, such as handling homogeneous planar scenes and potential degeneracies in homography calculations under consistent local point errors. Future improvements may involve integrating robust estimation techniques like ray-based matching to mitigate these issues.

Conclusion

VGGT-SLAM represents a significant shift in SLAM methodologies, introducing the optimization on SL(4) manifold to address projective ambiguities. Its success with dense monocular SLAM systems portends further advancements in feed-forward scene reconstruction. This opens avenues for more scalable systems that can leverage broader visual information across extensive real-world environments.