SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting

Published 7 Mar 2025 in cs.CV and cs.RO | (2503.05174v1)

Abstract: 6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, a novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.

Abstract PDF Upgrade to Chat

Summary

Overview of SplatPose: Geometry-Aware 6-DoF Pose Estimation

The paper presents SplatPose, a novel framework addressing the challenges of six degrees of freedom (6-DoF) pose estimation utilizing single RGB images. This framework leverages advancements in 3D Gaussian Splatting (3DGS) to develop a dual-branch neural architecture, optimizing pose estimation without requiring depth sensors or multi-view image inputs.

Architectural Innovations

Central to SplatPose is the introduction of the Dual-Attention Ray Scoring Network (DARS-Net). DARS-Net represents a significant departure from conventional methods by decoupling positional and angular alignment, specifically aiming to reduce rotational ambiguity inherent in existing RGB-based pose estimation frameworks. By doing so, the framework achieves improved precision in estimating both translation and rotation. This is accomplished through geometry-domain attention mechanisms that explicitly model directional dependencies.

Another critical innovation within the framework is its coarse-to-fine optimization pipeline. This pipeline refines pose estimates initially derived from sparse ray sampling, aligning dense 2D image features between query images and views synthesized from the 3DGS model. By progressively correcting feature misalignment and depth errors, this refinement process ensures enhanced accuracy, comparable to techniques utilizing depth or multi-view setups.

Empirical Evaluation

The efficacy of SplatPose is demonstrated through rigorous testing against benchmark datasets including Mip-NeRF 360°, Tanks{content}Temples, and 12Scenes. Notably, on the Mip-NeRF 360° dataset, SplatPose achieves significantly reduced mean angular error and mean translation error compared to existing methods like 6DGS. The framework exhibits robustness across diverse and challenging scenes, outperforming previous approaches in both structured and cluttered environments.

The results on Tanks{content}Temples further validate SplatPose's capabilities, achieving performance metrics that decisively surpass those of depth- and multi-view-based methods. In scenarios involving complex physical environments and varying light conditions, SplatPose maintains high-fidelity pose estimation, which is crucial for practical applications in robotics and augmented reality.

Moreover, the detailed memory usage comparisons emphasize the framework's computational efficiency. SplatPose not only reduces memory footprint considerably in comparison to prior methods but also offers faster inference times, crucial for real-time applications.

Implications and Future Directions

The implications of SplatPose extend to potential advancements in real-time applications where rapid and precise pose estimation is critical. The framework’s deployment without relying on extensive data or advanced sensory equipment highlights its suitability for applications in resource-constrained settings. Its explicit geometry modeling could pave the way for more adaptable and efficient pose estimation systems in dynamic environments.

Looking forward, further developments are anticipated in the integration of SplatPose with broader AI systems. Its compatibility with single RGB input suggests promising avenues in enhancing visual perception systems within autonomous systems or virtual interaction platforms. Future research may explore the adaptability of SplatPose to varying computational architectures and further optimizations in reducing inference latency while maintaining high accuracy.

In sum, SplatPose provides a significant contribution to the field of computer vision, offering a potent combination of precision and efficiency in pose estimation using single RGB images. Its approach to overcoming the limitations of rotational ambiguity and feature misalignment represents a forward-looking method in scene understanding and reconstruction.