- The paper presents a marker-free localisation framework that leverages stereo-RGB cameras and transformer models to achieve submillimetre accuracy in draped surgical environments.
- It employs a stereo differentiable rendering algorithm and innovative data augmentation to tackle occlusions and enhance multi-robot segmentation, validated on a large multi-centre dataset.
- The method also integrates proprioceptive breathing motion estimation for dynamic tissue tracking, enabling effective automated pedicle screw placement in spinal surgery.
Marker-Free Proprioception for Distributed Surgical Robotic Systems
Introduction
This paper presents a comprehensive framework for marker-free localisation of surgical robots under sterile draping, addressing a critical limitation in current surgical robotics: the lack of spatial awareness in crowded, dynamic operating environments. The proposed method leverages lightweight stereo-RGB cameras and transformer-based deep learning models to achieve robust, holistic localisation of fully draped robots, eliminating the need for bulky infrared tracking systems and reflective markers. The approach is validated on the largest multi-centre spatial robotic surgery dataset to date, comprising 1.43 million self-annotated images from human cadaveric and preclinical in vivo studies. The system is demonstrated in the context of robotic spine surgery, specifically pedicle screw placement, where submillimetre precision and dynamic tissue motion tracking are essential.
Technical Contributions
Stereo Differentiable Rendering (SDR) for Robot Localisation
The core of the localisation framework is a stereo differentiable rendering (SDR) algorithm. SDR aligns virtual robot models to silhouette segmentations inferred from stereo-RGB images, optimising the robot's pose in the camera coordinate system. The optimisation objective is formulated as:
Θleftargminf(Mleft(Θleft,Vl′),Sleft)+f(Mright(Θright(Θleft),Vl′),Sright)
where Mleft/right are binary silhouette renders of the robot mesh, and Sleft/right are segmentation masks. The cost function f is designed to emphasise boundary alignment and robustness to segmentation inaccuracies, using either MSE with Euclidean distance transforms or a Dice loss variant with exponential decay for gradient smoothing.
Camera pose initialisation is achieved via a camera swarm optimisation (CSO) algorithm, a parallelised particle swarm approach that samples virtual camera poses within a hollow sphere about the robot base, enabling robust convergence in high-dimensional pose space.
Drape- and Occlusion-Invariant Segmentation
Accurate segmentation of draped robots is achieved through supervised training on self-annotated datasets, where ground-truth segmentations are generated by kinematic tracking and rendering. The segmentation pipeline employs UNet architectures with transformer-based (MIT-B3/B5) or convolutional (ResNet-101) encoders. A novel cut, mix, merge (CMM) augmentation simulates multi-robot setups, enhancing generalisation to occlusions and variable robot counts. The best MIT-B5-based model achieves an IoU of 0.73±0.06 on challenging multi-robot test data, outperforming the SAM~2 foundation model (0.60±0.03 IoU) with one-fifth the parameter count and no user interaction.
In-Context Priors and Iterative Refinement (SDRICP)
To further improve localisation under severe occlusion, the framework introduces in-context priors: silhouette renders of the robot given the current best pose estimate are appended as a fourth input channel to the segmentation model. This enables iterative refinement (SDRICP), alternating between segmentation and pose optimisation, yielding substantial improvements in localisation accuracy and error range reduction (average 26% and 46%, respectively).
Proprioceptive Breathing Motion Estimation
The system enables proprioceptive estimation of tissue dynamics by tracking AprilTag markers affixed to the anatomy via stereo-RGB triangulation. This approach provides a holistic, physiologically accurate view of respiratory motion, with 25% higher visibility than infrared markers and submillimetre accuracy. The breathing motion is modelled as a truncated Fourier series and used for real-time compensation during drilling, with control implemented as a quadratic optimisation problem in joint velocity space.
Experimental Validation and Results
Multi-Centre Dataset and Benchmarking
The framework is validated across four European centres, with both preclinical and clinical workflow conditions. In robotic spine surgery, the system achieves sub-percent localisation accuracy at the robot base (<0.16% / <4mm error at 2.6m distance) and outperforms marker-based methods at the tool centre even with unknown tool geometry. On the draped mock spine surgery benchmark, SDRICP achieves median errors of 1.33mm (draped) vs. 0.97mm (undraped), with only submillimetre error increase due to draping.
Clinical Relevance: Breathing-Compensated Drilling
During in vivo porcine spine surgery, the system enables automated pedicle screw placement with visual breathing compensation. Stereo-RGB-based tracking reveals tissue motions and deformations invisible to conventional infrared systems, with 250% greater deformation observed in the drilling direction and physiologically consistent cranial displacement. Marker visibility is improved by 25%, and the system achieves a 13-fold reduction in camera weight and threefold reduction in size compared to infrared setups.
Segmentation and Localisation in Multi-Robot Setups
The CMM augmentation and MIT-B5-based models generalise segmentation to multi-robot scenes, with SDRICP achieving repeatability errors of 0.39cm at the robot base and 1.66cm at the end effector (<0.65% at 2.6m distance), sufficient for safe collision avoidance and context-aware control.
Limitations and Implementation Considerations
- Segmentation Accuracy: Drape-invariant segmentation remains challenging, with best IoU below $0.8$. Input resolution is limited to 512×512 due to computational constraints.
- Model Specificity: Supervised training induces strong dependence on the robot model; deployment across platforms requires retraining or adaptation.
- Multi-Robot Localisation: Current localisation is per-robot, with the number of robots assumed known a priori.
- Initialisation and Calibration: Reliable convergence depends on robust initialisation (CSO) and access to joint sensor readings, which may not be available on all platforms.
- Clinical Workflow: The method assumes a static setup; intraoperative rearrangement requires recalibration. Sterility maintenance for bedside cameras is an open challenge.
Implications and Future Directions
The presented framework advances surgical scene understanding by enabling marker-free, holistic localisation of distributed robotic systems under realistic clinical constraints. The elimination of markers and reduction in hardware burden facilitate modular, flexible deployment in crowded operating theatres. The ability to track tissue dynamics and robot interactions in real time opens avenues for intelligent multi-robot coordination, context-aware control, and autonomous surgical actions.
Future research should focus on:
- Generalisation Across Platforms: Leveraging self-identifying robots and differentiable rendering to eliminate dependence on known robot models and tool geometries.
- Dynamic Scene Localisation: Adapting SDR to dynamic surgical scenes and continuous calibration, potentially obviating explicit calibration steps.
- Unified Optimisation: Developing joint optimisation schemes for camera pose and robot configuration to enhance robustness and simplicity.
- Clinical Integration: Addressing sterility, workflow flexibility, and regulatory validation for deployment in safety-critical autonomous tasks.
Conclusion
This work demonstrates that marker-free proprioception for fully draped surgical robots is feasible and clinically relevant, achieving robust localisation and tissue motion tracking with lightweight, low-cost hardware. The framework supports modular, distributed robotic systems and lays the foundation for advanced autonomous capabilities in surgical robotics, with direct implications for safety, efficiency, and intelligent multi-robot interaction. The methods and datasets released provide a valuable resource for further research and development in the field.