Papers
Topics
Authors
Recent
Search
2000 character limit reached

Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge

Published 22 Jul 2025 in cs.CV | (2507.16559v1)

Abstract: Reliable recognition and localization of surgical instruments in endoscopic video recordings are foundational for a wide range of applications in computer- and robot-assisted minimally invasive surgery (RAMIS), including surgical training, skill assessment, and autonomous assistance. However, robust performance under real-world conditions remains a significant challenge. Incorporating surgical context - such as the current procedural phase - has emerged as a promising strategy to improve robustness and interpretability. To address these challenges, we organized the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) sub-challenge as part of the Endoscopic Vision (EndoVis) challenge at MICCAI 2024. We introduced a novel, multi-center dataset comprising thirteen full-length laparoscopic cholecystectomy videos collected from three distinct medical institutions, with unified annotations for three interrelated tasks: surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. Unlike existing datasets, ours enables joint investigation of instrument localization and procedural context within the same data while supporting the integration of temporal information across entire procedures. We report results and findings in accordance with the BIAS guidelines for biomedical image analysis challenges. The PhaKIR sub-challenge advances the field by providing a unique benchmark for developing temporally aware, context-driven methods in RAMIS and offers a high-quality resource to support future research in surgical scene understanding.

Summary

  • The paper demonstrates that transformer-based models integrating temporal information significantly improve surgical phase recognition and instrument segmentation performance.
  • It reveals that incorporating contextual data from a novel multi-center endoscopy dataset enhances instrument keypoint estimation and procedural understanding.
  • The challenge outcomes underscore the need for robust, generalizable models to bridge gaps in real-world surgical assistance applications.

Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge

Overview

The paper "Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge" (2507.16559) presents outcomes from the PhaKIR 2024 challenge, which focusses on three fundamental tasks in robotic-assisted minimally invasive surgery (RAMIS): surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. This challenge is anchored in the context of the Endoscopic Vision Challenge (EndoVis) at MICCAI 2024 and introduces a novel dataset comprising full-length laparoscopic cholecystectomy videos from three different medical centers. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Visualization of the PhaKIR tasks and annotations for each of the three medical centers. Displays surgical phase recognition phases, color-encoded masks for instrument instance segmentation, and keypoint coordinates for instrument keypoint estimation.

Dataset and Methodology

The PhaKIR challenge dataset includes thirteen full-length videos annotated for the aforementioned tasks, providing an unprecedented opportunity for integrated analysis and temporal information integration. The dataset is designed to overcome the limitations seen in previous datasets, primarily by enabling joint evaluation of instrument localization with procedural context, supporting the development of temporally aware methods.

The results of the PhaKIR challenge highlight the effectiveness of embedding contextual and temporal information into models for surgical scene understanding. Particularly, transformer-based architectures and advanced temporal modeling strategies demonstrated superior performance compared to traditional convolutional models, underscoring a shift in computational paradigms within this domain.

Key Findings

  1. Surgical Phase Recognition: All top teams employed transformer models with temporal information, achieving impressive scores for both F1 and Balanced Accuracy (BA), significantly outperforming architectures relying solely on frame-wise processing.
  2. Instrument Instance Segmentation: Transformer-based methods such as Mask2Former topped the performance charts. The necessity of distinguishing between individual instrument instances remains a major hurdle, with many submissions applying semantic segmentation instead.
  3. Instrument Keypoint Estimation: Although this task received less attention (two teams only), the results underline its complexity and importance. The challenge centered on localizing keypoints accurately across diverse instrument types, with varying effectiveness. Figure 2

    Figure 2: Number of participants registered and submitted for each of the eight individual EndoVis-2024 sub-challenges, illustrating participation dynamics.

Implications and Future Directions

The PhaKIR challenge marks vital progress toward context-aware surgical video analysis tools, contributing essential benchmarks for developing robust, real-world deployable systems. Future research aims must concentrate on improving generalizability across diverse and unseen surgical settings, advanced solutions for keypoint estimation, and further exploring temporal features’ roles in segmenting and localizing surgical tools.

Moreover, the insights into temporal integration demonstrated clear superiority in phase recognition tasks, suggesting underutilized potential in other tasks such as keypoint estimation and instrument segmentation.

Conclusion

The PhaKIR 2024 challenge provides pivotal groundwork and an extensive dataset for future exploration in RAMIS contexts. By addressing generalizability and exploiting temporal information, ongoing efforts seek to bridge current gaps between sophisticated machine learning models and practical surgical assistance applications.

This research is generously supported by several institutions and aligned with the community's ongoing efforts to enhance surgical practice through advanced AI integration.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Glossary

  • 95% Hausdorff Distance (95% HD): A robust boundary-based metric measuring the 95th percentile of shortest distances between contours of predicted and ground-truth segmentations. "the 95\% Hausdorff-Distance (95\% HD, \cite{huttenlocher1993comparing}) served as the boundary-based metric."
  • Balanced Accuracy (BA): The unweighted average of sensitivity and specificity, useful for class imbalance. "Balanced Accuracy (BA,~\cite{tharwat2021classification}) served as the metric for evaluating the overall multi-class performance."
  • BIAS guidelines: Reporting standards for biomedical image analysis challenges to ensure transparent and fair evaluation. "We report results and findings in accordance with the BIAS guidelines for biomedical image analysis challenges."
  • bootstrapping: A resampling technique with replacement used to estimate metric stability and confidence. "we applied bootstrapping~\cite{efron1992bootstrap} with 10,000 iterations."
  • calot triangle dissection (CTD): A defined surgical phase in cholecystectomy involving dissection of Calot’s triangle. "calot triangle dissection (CTD)"
  • Cholec80 dataset: A benchmark dataset of cholecystectomy videos annotated with surgical phases. "following the classification scheme introduced in the Cholec80 dataset by~\cite{twinanda2016endonet}."
  • cholecystectomy: Surgical removal of the gallbladder. "thirteen full-length laparoscopic cholecystectomy videos"
  • cleaning and coagulation (ClCo): A surgical phase focusing on hemostasis and cleanup. "cleaning and coagulation (ClCo)"
  • clipping and cutting (ClCu): A surgical phase where structures are clipped and transected. "clipping and cutting (ClCu)"
  • COCO evaluation protocol: A standardized evaluation scheme (for detection/segmentation/keypoints) computing mAP across multiple thresholds. "The mean Average Precision (mAP) is computed following the COCO evaluation protocol~\cite{lin2014microsoft}"
  • Computer Vision Annotation Tool (CVAT): An annotation tool for images and videos used to create segmentation and keypoint labels. "the Computer Vision Annotation Tool (CVAT)\footnote{\url{https://www.cvat.ai/} was used."
  • Dice Similarity Coefficient (DSC): An overlap metric for segmentation accuracy ranging from 0 (no overlap) to 1 (perfect overlap). "the Dice Similarity Coefficient (DSC, \cite{dice1945measures}) was used as the multi-instance multi-class overlap metric"
  • EndoVis (Endoscopic Vision Challenge): A MICCAI-affiliated challenge series focused on endoscopic image analysis. "the Endoscopic Vision (EndoVis) challenge at MICCAI 2024."
  • ex-vivo: Data or procedures performed outside a living organism. "sequences featuring robotic instruments are based on ex-vivo data from animal tissue"
  • F1-score: The harmonic mean of precision and recall, balancing false positives and false negatives. "the F1-score~\hbox{\cite{van1979information,chinchor1992muc4} was used as the per-class performance metric"
  • gallbladder dissection (GD): A surgical phase involving dissection of the gallbladder from its bed. "gallbladder dissection (GD)"
  • gallbladder packaging (GP): A surgical phase where the gallbladder is bagged for extraction. "gallbladder packaging (GP)"
  • gallbladder retraction (GR): A surgical phase involving retraction to expose the operative field. "gallbladder retraction (GR)"
  • HeiChole Benchmark: A benchmark from the Surgical Workflow and Skill Analysis Challenge used for phase recognition research. "as part of the Surgical Workflow and Skill Analysis Challenge (HeiChole Benchmark, \cite{wagner2019comparative})"
  • Hungarian Maximum Matching Algorithm: An algorithm for optimal assignment used to match predicted and ground-truth instances. "we applied the Hungarian Maximum Matching Algorithm~\cite{kuhn1955hungarian} based on the IoU between all possible prediction-ground truth pairs."
  • in-vivo: Data or procedures performed within a living organism. "for in-vivo recordings involving manual instruments"
  • Instrument instance segmentation: Pixel-precise multi-class, multi-instance segmentation of instruments in images. "Instrument instance segmentation: Segmentation of the contours of the surgical instruments as precisely as possible through pixel-accurate predictions and distinguishing different instrument classes and instances of the same instrument class."
  • Instrument keypoint estimation: Localizing predefined keypoints on surgical instruments, accounting for visibility and instrument type. "Instrument keypoint estimation: Location of certain keypoints on the surgical instruments as precisely as possible through pixel-accurate coordinates, and distinguishing different instrument classes and instances of the same instrument class, considering that the number of keypoints depends on the type of the individual instrument."
  • Intersection-over-Union (IoU): An overlap ratio between predicted and ground-truth regions used for instance matching. "all Intersection-over-Union (IoU) thresholds T={0.50,0.55,…,0.95}T = \{0.50, 0.55, \ldots, 0.95\}:"
  • mean Average Precision (mAP): The mean of average precision values across classes and thresholds, summarizing detection/segmentation performance. "The mean Average Precision (mAP) is computed following the COCO evaluation protocol~\cite{lin2014microsoft}"
  • MICCAI: The Medical Image Computing and Computer Assisted Intervention conference/society. "the annual conference of the Medical Image Computing and Computer Assisted Intervention (MICCAI) society"
  • Metrics Reloaded Framework: A framework of recommendations for selecting appropriate metrics in medical image analysis. "followed the recommendations of the Metrics Reloaded Framework proposed by~\cite{maier2024metrics}."
  • Object Keypoint Similarity (OKS): A keypoint similarity measure that accounts for object scale and keypoint visibility. "the $\text{mAP}_{\text{OKS}$ metric based on the Object Keypoint Similarity (OKS) following the COCO evaluation protocol~\cite{cocoKeypointEval}."
  • PhaKIR: The Surgical Procedure Phase, Keypoint, and Instrument Recognition sub-challenge at EndoVis 2024. "Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) sub-challenge"
  • porcine tissue: Pig tissue used in experimental datasets, often simplifying instrument recognition vs. human tissue. "operate on porcine tissue~\cite{allan2017robotic, allan2018robotic}, which significantly simplifies the recognition of surgical instruments compared to human tissue"
  • preparation (P): The initial surgical phase preparing the operative field. "preparation (P)"
  • RAMIS: Robot-assisted minimally invasive surgery. "robot-assisted minimally invasive surgery (RAMIS)"
  • shaft-tip transition: A keypoint marking the junction between an instrument’s shaft and tip. "the shaft-tip transition, indicating the junction between the shaft and the tip,"
  • Surgical phase recognition: Classification of procedural phases from endoscopic video. "Surgical phase recognition: Classification of the surgical phases of a cholecystectomy as accurately as possible."
  • undefined phase: A label used for transitional frames between phases. "an eighth category -- an undefined phase -- was introduced to label transitional frames between two phases."

Collections

Sign up for free to add this paper to one or more collections.