EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery

Published 7 Jun 2025 in cs.CV and cs.AI | (2506.06830v1)

Abstract: Endoscopic surgery is the gold standard for robotic-assisted minimally invasive surgery, offering significant advantages in early disease detection and precise interventions. However, the complexity of surgical scenes, characterized by high variability in different surgical activity scenarios and confused image features between targets and the background, presents challenges for surgical environment understanding. Traditional deep learning models often struggle with cross-activity interference, leading to suboptimal performance in each downstream task. To address this limitation, we explore multi-task learning, which utilizes the interrelated features between tasks to enhance overall task performance. In this paper, we propose EndoARSS, a novel multi-task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation. Built upon the DINOv2 foundation model, our approach integrates Low-Rank Adaptation to facilitate efficient fine-tuning while incorporating Task Efficient Shared Low-Rank Adapters to mitigate gradient conflicts across diverse tasks. Additionally, we introduce the Spatially-Aware Multi-Scale Attention that enhances feature representation discrimination by enabling cross-spatial learning of global information. In order to evaluate the effectiveness of our framework, we present three novel datasets, MTLESD, MTLEndovis and MTLEndovis-Gen, tailored for endoscopic surgery scenarios with detailed annotations for both activity recognition and semantic segmentation tasks. Extensive experiments demonstrate that EndoARSS achieves remarkable performance across multiple benchmarks, significantly improving both accuracy and robustness in comparison to existing models. These results underscore the potential of EndoARSS to advance AI-driven endoscopic surgical systems, offering valuable insights for enhancing surgical safety and efficiency.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a multi-task framework that adapts a DINOv2 foundation model with LoRA for efficient activity recognition and semantic segmentation in endoscopic surgery.
It employs Task Efficient Shared Low-rank Adapters to mitigate gradient conflicts and enhance performance across diverse surgical tasks.
Spatially-Aware Multi-Scale Attention improves feature discrimination, leading to significant accuracy gains on specialized multi-task endoscopic datasets.

Overview of "EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery"

The paper introduces EndoARSS, a novel multi-task learning framework developed to enhance activity recognition and semantic segmentation in endoscopic surgery. The complexity of surgical scenes, often marked by significant variability and indistinct features between targets and the background, poses formidable challenges to traditional deep learning models which can suffer from cross-activity interference and suboptimal task performance. EndoARSS addresses these limitations through coordinated enhancements spanning architecture, adaptation techniques, and data strategy, ultimately advancing the comprehension of surgical contexts and integration within robotic surgery systems.

Technical Advancements in EndoARSS Framework

Foundation Model Utilization and Adaptation: The EndoARSS framework is built upon the DINOv2 foundation model, reputed for its capacity in visual feature extraction across multiple downstream tasks. To effectively tailor this large-scale model for the specific domain of endoscopic surgery, the paper leverages Low-Rank Adaptation (LoRA). By freezing the foundation model and integrating trainable LoRA layers for efficient fine-tuning, adaptation to the domain-specific requirements can occur without the prohibitive computational demands of extensive model re-training.
Task Efficient Shared Low-rank Adapters (TESLA): TESLA mitigates gradient conflicts arising from data heterogeneity inherent to multi-task learning by introducing task-specific adapters adjacent to the shared foundation model backbone. This methodology effectively isolates task-specific parameter spaces and facilitates reduced interference between activity recognition and segmentation tasks during training, thus bolstering overall performance.
Spatially-Aware Multi-Scale Attention (SMA): Addressing the challenge of feature ambiguity in complex surgical environments, SMA encodes both local and global spatial information within the model. This cross-spatial learning mechanism enables EndoARSS to generate discriminative features, essential for precise surgical scene understanding. SMA's multi-scale approach enhances discrimination among similar-looking regions, a frequent occurrence in endoscopic imagery due to tissue and instrument visual similarity.

Evaluation and Results

The efficacy of EndoARSS was tested on three novel multi-task datasets specifically curated for endoscopic surgery scenarios—MTLESD, MTLEndovis, and MTLEndovis-Gen. These comprehensive datasets provide diverse annotations to evaluate the dual tasks across various scenarios. The experimental results indicate significant performance improvements across benchmarks compared to existing models, particularly in terms of accuracy and robustness. The model demonstrated marked improvements in classification accuracy for surgical activity recognition as well as segmentation performance, as reflected by enhanced metrics such as mean Intersection over Union (mIoU) and Dice Similarity (DS).

Implications and Future Directions

EndoARSS exemplifies how advanced adaptation techniques can empower foundation models to address domain-specific tasks effectively. The development and evaluation of EndoARSS contribute valuable insights toward the safer and more efficient deployment of AI-driven systems in surgical contexts. The paper suggests promising directions for further exploration, including the refinement of adaptation strategies and architectural advancements. There is potential for expanding the framework's applicability to other surgical modalities and integrating additional tasks to support comprehensive automated surgical analysis systems.

Future work may explore lightweight architectures that reduce computational costs and facilitate real-time application in clinical settings. Additionally, efforts toward constructing larger, diverse multi-task datasets would enable further evaluation and development of models suited for varied surgical procedures and environments. The paper lays the groundwork for endeavors that combine domain-specific challenges with robust AI methodologies, advancing the trajectory of robotic-assisted surgical technology.