Localising under the drape: proprioception in the era of distributed surgical robotic system

Published 27 Oct 2025 in cs.RO and cs.CV | (2510.23512v1)

Abstract: Despite their mechanical sophistication, surgical robots remain blind to their surroundings. This lack of spatial awareness causes collisions, system recoveries, and workflow disruptions, issues that will intensify with the introduction of distributed robots with independent interacting arms. Existing tracking systems rely on bulky infrared cameras and reflective markers, providing only limited views of the surgical scene and adding hardware burden in crowded operating rooms. We present a marker-free proprioception method that enables precise localisation of surgical robots under their sterile draping despite associated obstruction of visual cues. Our method solely relies on lightweight stereo-RGB cameras and novel transformer-based deep learning models. It builds on the largest multi-centre spatial robotic surgery dataset to date (1.4M self-annotated images from human cadaveric and preclinical in vivo studies). By tracking the entire robot and surgical scene, rather than individual markers, our approach provides a holistic view robust to occlusions, supporting surgical scene understanding and context-aware control. We demonstrate an example of potential clinical benefits during in vivo breathing compensation with access to tissue dynamics, unobservable under state of the art tracking, and accurately locate in multi-robot systems for future intelligent interaction. In addition, and compared with existing systems, our method eliminates markers and improves tracking visibility by 25%. To our knowledge, this is the first demonstration of marker-free proprioception for fully draped surgical robots, reducing setup complexity, enhancing safety, and paving the way toward modular and autonomous robotic surgery.

Abstract PDF Upgrade to Chat

Summary

The paper presents a marker-free localisation framework that leverages stereo-RGB cameras and transformer models to achieve submillimetre accuracy in draped surgical environments.
It employs a stereo differentiable rendering algorithm and innovative data augmentation to tackle occlusions and enhance multi-robot segmentation, validated on a large multi-centre dataset.
The method also integrates proprioceptive breathing motion estimation for dynamic tissue tracking, enabling effective automated pedicle screw placement in spinal surgery.

Marker-Free Proprioception for Distributed Surgical Robotic Systems

Introduction

This paper presents a comprehensive framework for marker-free localisation of surgical robots under sterile draping, addressing a critical limitation in current surgical robotics: the lack of spatial awareness in crowded, dynamic operating environments. The proposed method leverages lightweight stereo-RGB cameras and transformer-based deep learning models to achieve robust, holistic localisation of fully draped robots, eliminating the need for bulky infrared tracking systems and reflective markers. The approach is validated on the largest multi-centre spatial robotic surgery dataset to date, comprising 1.43 million self-annotated images from human cadaveric and preclinical in vivo studies. The system is demonstrated in the context of robotic spine surgery, specifically pedicle screw placement, where submillimetre precision and dynamic tissue motion tracking are essential.

Technical Contributions

Stereo Differentiable Rendering (SDR) for Robot Localisation

The core of the localisation framework is a stereo differentiable rendering (SDR) algorithm. SDR aligns virtual robot models to silhouette segmentations inferred from stereo-RGB images, optimising the robot's pose in the camera coordinate system. The optimisation objective is formulated as:

$\argmin_{\boldsymbol{\Theta}_\text{left}} f(\mathbf{M}_\text{left}(\boldsymbol{\Theta}_\text{left}, \mathcal{V}^\prime_l), \mathbf{S}_\text{left}) + f(\mathbf{M}_\text{right}(\boldsymbol{\Theta}_\text{right}(\boldsymbol{\Theta}_\text{left}), \mathcal{V}^\prime_l), \mathbf{S}_\text{right})$

where $\mathbf{M}_{\text{left/right}}$ are binary silhouette renders of the robot mesh, and $\mathbf{S}_{\text{left/right}}$ are segmentation masks. The cost function $f$ is designed to emphasise boundary alignment and robustness to segmentation inaccuracies, using either MSE with Euclidean distance transforms or a Dice loss variant with exponential decay for gradient smoothing.

Camera pose initialisation is achieved via a camera swarm optimisation (CSO) algorithm, a parallelised particle swarm approach that samples virtual camera poses within a hollow sphere about the robot base, enabling robust convergence in high-dimensional pose space.

Drape- and Occlusion-Invariant Segmentation

Accurate segmentation of draped robots is achieved through supervised training on self-annotated datasets, where ground-truth segmentations are generated by kinematic tracking and rendering. The segmentation pipeline employs UNet architectures with transformer-based (MIT-B3/B5) or convolutional (ResNet-101) encoders. A novel cut, mix, merge (CMM) augmentation simulates multi-robot setups, enhancing generalisation to occlusions and variable robot counts. The best MIT-B5-based model achieves an IoU of $0.73 \pm 0.06$ on challenging multi-robot test data, outperforming the SAM~2 foundation model ( $0.60 \pm 0.03$ IoU) with one-fifth the parameter count and no user interaction.

In-Context Priors and Iterative Refinement (SDRICP)

To further improve localisation under severe occlusion, the framework introduces in-context priors: silhouette renders of the robot given the current best pose estimate are appended as a fourth input channel to the segmentation model. This enables iterative refinement (SDRICP), alternating between segmentation and pose optimisation, yielding substantial improvements in localisation accuracy and error range reduction (average $26\%$ and $46\%$ , respectively).

Proprioceptive Breathing Motion Estimation

The system enables proprioceptive estimation of tissue dynamics by tracking AprilTag markers affixed to the anatomy via stereo-RGB triangulation. This approach provides a holistic, physiologically accurate view of respiratory motion, with $25\%$ higher visibility than infrared markers and submillimetre accuracy. The breathing motion is modelled as a truncated Fourier series and used for real-time compensation during drilling, with control implemented as a quadratic optimisation problem in joint velocity space.

Experimental Validation and Results

Multi-Centre Dataset and Benchmarking

The framework is validated across four European centres, with both preclinical and clinical workflow conditions. In robotic spine surgery, the system achieves sub-percent localisation accuracy at the robot base ( $<0.16\%$ / $<4\,\text{mm}$ error at $2.6\,\text{m}$ distance) and outperforms marker-based methods at the tool centre even with unknown tool geometry. On the draped mock spine surgery benchmark, SDRICP achieves median errors of $1.33\,\text{mm}$ (draped) vs. $0.97\,\text{mm}$ (undraped), with only submillimetre error increase due to draping.

Clinical Relevance: Breathing-Compensated Drilling

During in vivo porcine spine surgery, the system enables automated pedicle screw placement with visual breathing compensation. Stereo-RGB-based tracking reveals tissue motions and deformations invisible to conventional infrared systems, with $250\%$ greater deformation observed in the drilling direction and physiologically consistent cranial displacement. Marker visibility is improved by $25\%$ , and the system achieves a 13-fold reduction in camera weight and threefold reduction in size compared to infrared setups.

Segmentation and Localisation in Multi-Robot Setups

The CMM augmentation and MIT-B5-based models generalise segmentation to multi-robot scenes, with SDRICP achieving repeatability errors of $0.39\,\text{cm}$ at the robot base and $1.66\,\text{cm}$ at the end effector ( $<0.65\%$ at $2.6\,\text{m}$ distance), sufficient for safe collision avoidance and context-aware control.

Limitations and Implementation Considerations

Segmentation Accuracy: Drape-invariant segmentation remains challenging, with best IoU below $0.8$. Input resolution is limited to $512 \times 512$ due to computational constraints.
Model Specificity: Supervised training induces strong dependence on the robot model; deployment across platforms requires retraining or adaptation.
Multi-Robot Localisation: Current localisation is per-robot, with the number of robots assumed known a priori.
Initialisation and Calibration: Reliable convergence depends on robust initialisation (CSO) and access to joint sensor readings, which may not be available on all platforms.
Clinical Workflow: The method assumes a static setup; intraoperative rearrangement requires recalibration. Sterility maintenance for bedside cameras is an open challenge.

Implications and Future Directions

The presented framework advances surgical scene understanding by enabling marker-free, holistic localisation of distributed robotic systems under realistic clinical constraints. The elimination of markers and reduction in hardware burden facilitate modular, flexible deployment in crowded operating theatres. The ability to track tissue dynamics and robot interactions in real time opens avenues for intelligent multi-robot coordination, context-aware control, and autonomous surgical actions.

Future research should focus on:

Generalisation Across Platforms: Leveraging self-identifying robots and differentiable rendering to eliminate dependence on known robot models and tool geometries.
Dynamic Scene Localisation: Adapting SDR to dynamic surgical scenes and continuous calibration, potentially obviating explicit calibration steps.
Unified Optimisation: Developing joint optimisation schemes for camera pose and robot configuration to enhance robustness and simplicity.
Clinical Integration: Addressing sterility, workflow flexibility, and regulatory validation for deployment in safety-critical autonomous tasks.

Conclusion

This work demonstrates that marker-free proprioception for fully draped surgical robots is feasible and clinically relevant, achieving robust localisation and tissue motion tracking with lightweight, low-cost hardware. The framework supports modular, distributed robotic systems and lays the foundation for advanced autonomous capabilities in surgical robotics, with direct implications for safety, efficiency, and intelligent multi-robot interaction. The methods and datasets released provide a valuable resource for further research and development in the field.

Markdown Report Issue