SurgPose: Generalisable Surgical Instrument Pose Estimation using Zero-Shot Learning and Stereo Vision

Published 16 May 2025 in cs.CV, cs.AI, cs.LG, and cs.RO | (2505.11439v1)

Abstract: Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the FoundationPose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.

Abstract PDF Upgrade to Chat

Summary

Analysis of "SurgPose: Generalisable Surgical Instrument Pose Estimation using Zero-Shot Learning and Stereo Vision"

The paper presents a novel approach to surgical instrument pose estimation in Robot-assisted Minimally Invasive Surgery (RMIS) utilizing zero-shot learning and stereo vision. The authors address several challenges inherent in RMIS posed by traditional marker-based and supervised learning methods, such as the need for specialized markers, susceptibility to occlusions, and reflection limitations, and the extensive data annotation required for supervised approaches.

Key Contributions and Methodology

The paper introduces a 6 Degrees of Freedom (DoF) pose estimation pipeline employing advanced zero-shot RGB-D models like FoundationPose and SAM-6D, enhanced by the RAFT-Stereo method for depth estimation. The authors advance the SAM-6D model by substituting its instance segmentation component, the Segment Anything Model (SAM), with a fine-tuned Mask R-CNN. This modification aims to improve segmentation accuracy under conditions where instruments are occluded or reflective, typical constraints within surgical environments.

The authors propose a comprehensive methodology involving several critical stages:

Stereo-Based Depth Estimation: Utilizing stereo vision compounded with RAFT-Stereo for accurate depth calculation in reflective and textureless environments.
Enhanced Segmentation: Introducing a fine-tuned Mask R-CNN model trained using synthetic data and real images to generate accurate segmentation masks necessary for zero-shot pose estimation.
Zero-Shot RGB-D Pose Estimation: Comparing performance across modified SAM-6D and analytic FoundationPose, among other models, citing improved accuracy and precision in challenging environments.

Theoretical and Practical Implications

The paper delineates significant implications both theoretically and practically within surgical robotics. Theoretically, it pushes the boundaries of zero-shot learning application in complex, real-world surgical scenarios where dynamic object exploration is paramount. The proposed model facilitates robust pose estimation without needing extensive retraining, enhancing adaptability to new or unseen surgical tools.

Practically, this research sets a new benchmark for accuracy and generalisability in surgical robotics, potentially enhancing operative precision, safety, and efficacy. The pipeline's robustness against occlusions and reflective surfaces offers an adaptable solution for enhancing real-time surgical navigation and instrument control.

Validation and Results

Extensive validation was conducted using several datasets, assessing the model's performance under non-occluded and occluded scenarios. The results indicate that the enhanced SAM-6D with Mask R-CNN significantly outperforms the original version and competes strongly with FoundationPose, especially in occluded environments. This improvement highlights the critical role of accurate instance segmentation and depth estimation in securing precise instrument tracking.

Future Directions

This study offers several future research avenues. Further exploration could delve deeper into refining depth estimation methods and integrating more sophisticated stereo algorithms. There is potential to extend the application of zero-shot models to other types of robotic surgery or different domains within medical robotics, contributing broader solutions across various categories of robotic-assisted procedures.

In summary, this paper presents a significant stride in surgical instrument tracking, marrying zero-shot learning capabilities with stereo vision advancements to address inherent challenges in RMIS environments effectively. This adaptable pipeline promises enhanced surgical precision and control, fostering improved surgical outcomes.