CholecInstanceSeg: Surgical Tool Segmentation

Updated 4 February 2026

CholecInstanceSeg is a comprehensive dataset with 41,933 annotated frames from 85 human procedures, addressing limits of earlier segmentation resources.
It utilizes advanced annotation protocols including human-in-the-loop verification and detailed quality control for reliable instance-level segmentation.
The dataset underpins research in real-time tool tracking, workflow automation, and augmented reality, with benchmarks from Mask R-CNN and Mask2Former.

CholecInstanceSeg is a large-scale, open-access tool instance segmentation dataset tailored for laparoscopic cholecystectomy. It was established to address critical gaps in previous surgical instrument segmentation resources, specifically by providing comprehensive, instance-level segmentation masks for instruments in real human surgical procedures. The dataset underpins research in real-time tool tracking, workflow automation, and scene understanding for advanced computer-assisted interventions, and serves as a foundation for both algorithm benchmarking and higher-level surgical action grounding tasks (Alabi et al., 2024, Alabi et al., 1 Nov 2025).

1. Motivation and Context

Precise instance segmentation of surgical tools forms a core prerequisite for numerous downstream clinical computing tasks, including real-time instrument tracking, augmented reality (AR) guidance, subtask automation, and safety systems in minimally invasive or robotic surgery. Existing public datasets prior to CholecInstanceSeg were limited by their scale (generally a few thousand mask annotations), their reliance on ex vivo or porcine models rather than human in-vivo cholecystectomy, and their semantic-only mask labels that lacked consistent instance identification or class diversity.

CholecInstanceSeg addresses these limitations by:

Annotating 41,933 frames from 85 unique human procedures, yielding 64,483 individually identified tool instances with semantic masks and unique instance IDs.
Covering seven common laparoscopic tool types across diverse clinical scenarios.
Providing detailed quality control and agreement statistics to ensure annotation reliability (Alabi et al., 2024).

2. Dataset Construction and Annotation Protocol

The dataset is compiled from multiple sources, each rigorously annotated and harmonized for tool instance tracking:

Sources and Partitioning:
- CholecSeg8k: 17 sequences, 8,080 frames, 2 tool classes.
- CholecT50-full: 15 sequences, ~28,317 frames, 7 tool classes.
- CholecT50-sparse: 35 sequences, 2,681 sparsely sampled frames.
- Cholec80-sparse: 28 sequences, 2,855 frames sampled at 1/30 fps.
- Total: 85 human surgical videos totaling 41,933 annotated frames (Alabi et al., 2024).
Tool Taxonomy:

Grasper
Bipolar
Hook
Clipper
Scissors
Irrigator
Snare

Annotation Protocol:
- Instrument boundaries are traced as polygons; interior holes are subsumed by the mask.
- Instrument ports are labeled only if a shaft extends from them.
- Annotation training guidelines address motion blur, smoke, occlusion, poor lighting, specular reflection, and unfinished tools at image edges.
- For CholecSeg8k, semantic masks are post-processed through class-aware connected component analysis and custom pipelines to correct occlusion splits, merges, and missed detections (892 frames adjusted).
- On CholecT50-full, a human-in-the-loop process using RTMDet-Ins accelerates annotation. Model-generated masks are iteratively corrected over two rounds, enabling scalable high-quality labeling for ≈30,000 frames.
- Rigorous quality control is performed, including manual review, secondary annotator checks (on a subset), corroboration with tool presence labels, and targeted audit of frames with high overlap or high instance count (Alabi et al., 2024).

3. Dataset Characteristics and Organization

CholecInstanceSeg delivers extensive metadata and conversion support for interoperability:

Image Properties:
- Most frames at 854×480 pixels; three sequences at 1920×1080 pixels (resize scripts provided).
Directory Structure:
- Separated into train (55 sequences, 26,830 frames), validation (17 sequences, 3,804 frames), and withheld test (23 sequences, 11,299 frames) splits.
- Each video’s folder contains img_dir/ (PNG images) and ann_dir/ (LabelMe-format JSON with polygons, class name, coordinates, and instance group ID).
- Scripts convert native format to COCO-format JSON including run-length encoding (RLE) for masks.
Class and Instance Distributions:
- "Grasper" is most frequent; "Snare" the rarest.
- 12.7% of frames (~5,328) have no tools; most frames contain 1–2 visible instances, with up to 4.

Tool Class	Frequency (Relative)	Comments
Grasper	Most frequent	Multiple per frame possible
Snare	Rarest	Sparse occurrence

Label Integrity:
- Inter-annotator Panoptic Quality (PQ): 91.2.
- Manual vs semi-automatic PQ: 95.7.
- Flagged and corrected frames ensure class presence and mask integrity.

4. Benchmarking, Baselines, and Evaluation Metrics

Standardized evaluation and strong baseline comparisons are integral to the dataset’s utility:

Benchmark Models:
- Mask R-CNN (He et al., 2017, CNN-based).
- Mask2Former (Cheng et al., CVPR 2022, transformer-based).
Task:
- Instance segmentation for surgical tools.
Data Processing:
- Train/validation/test splits as above with augmentations from official model repositories.
Performance Metrics:
- Intersection over Union (IoU): $\text{IoU}(P,G) = \frac{|P \cap G|}{|P \cup G|}$ .
- Average Precision (AP) at IoU threshold $\tau$ :
$\text{AP}(\tau) = \int_0^1 p(r) dr$

with $p(r)$ as precision at recall $r$ . - mean Average Precision (mAP) over $K$ classes and multiple thresholds:

$\text{mAP} = \frac{1}{K} \sum_{j=1}^{K} \text{AP}_j$ - COCO Sequence mAP (smAP): Sequence-level penalization of poor cross-video generalization.
Results on validation split:

Model	mAP	smAP
Mask R-CNN	0.470	0.437
Mask2Former	0.682	0.655

Mask2Former demonstrates substantial improvements, particularly in handling complex boundary and overlap scenarios for rare classes and under adverse visual conditions (Alabi et al., 2024).

5. Integration into Triplet Segmentation and Scene Understanding

CholecInstanceSeg provides the foundation for more advanced, spatially grounded scene understanding tasks. In the triplet segmentation paradigm (Alabi et al., 1 Nov 2025), each detected instrument instance is linked not only to class and mask, but also to an action verb and anatomical target, thus yielding structured outputs of the form ⟨instrument, verb, target⟩ per spatial instance.

Joint Dataset (CholecTriplet-Seg):
- 30,955 frames from 50 videos, annotated with 49,866 spatially grounded triplets.
- Instances from CholecInstanceSeg are matched to triplet labels from CholecT50 using timestamp and manual verification.
Architecture (TargetFusionNet):
- Extends Mask2Former with a target-aware fusion module.
- Incorporates weak anatomical priors (EndoViT logits for coarse tissues) into the transformer decoder through gated cross-attention.
- Supports single-head classification (over 100 triplet combinations) and decomposed multi-head variants.
Training and Losses:
- Hungarian bipartite matching optimizes assignment between predicted queries and ground-truth masks, incorporating mask and class-specific loss terms.
- Overall training uses AdamW optimizer, mixed-precision, and specific augmentation strategies.
- mAP is reported for instrument-only, verb-only, target-only, and full triplet outputs.
Results:
- TargetFusionNet achieves mAP $_I$ = 67.19 and mAP $_{IVT}$ = 13.47, exceeding Mask2Former-Triplet by 1.95 and 1.24 points, respectively.
- Qualitative gains include improved decomposition of multi-tool overlap and enhanced recall on rare anatomical structures and actions (Alabi et al., 1 Nov 2025).

6. Potential Applications and Extensions

The dataset and associated methodologies enable the following clinical and research advances:

Real-time tool tracking for collision avoidance and retraction safety in surgery.
Augmented reality overlays for tool labeling, trajectory display, and region-of-interest demarcation.
Automation of surgical subtasks requiring instrument instance awareness.
Research into temporal tracking and motion analysis via extension to multi-frame sequence labels.
Expansion to additional instrument and object types, keypoint localization, and 3D pose/depth annotation.
Potential for integration with multi-modal data sources such as endoscopic ultrasound, force measurements, and synchronized tissue interaction records (Alabi et al., 2024).

7. Implementation, Reproducibility, and Community Resources

Canonical codebases: PyTorch 1.12, MMDetection 2.22, with pretrained ResNet-50 and EndoViT weights.
Training is reproducible with single NVIDIA A100, following the released hyperparameters and augmentation routines.
The full TargetFusionNet implementation and dataset access details are available at https://github.com/labdeeman7/target_fusion_net (Alabi et al., 1 Nov 2025).

CholecInstanceSeg thus serves as a robust, large-scale, and meticulously curated resource for advancing instrument instance segmentation in both clinical and algorithmic research settings. Its integration with spatial action grounding frameworks further cements its relevance to next-generation computational surgery.

Markdown Report Issue Upgrade to Chat

References (2)

CholecInstanceSeg: A Tool Instance Segmentation Dataset for Laparoscopic Surgery (2024)

Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CholecInstanceSeg.