QTSeg: A Query Token-Based Dual-Mix Attention Framework with Multi-Level Feature Distribution for Medical Image Segmentation

Published 23 Dec 2024 in cs.CV and cs.AI | (2412.17241v2)

Abstract: Medical image segmentation plays a crucial role in assisting healthcare professionals with accurate diagnoses and enabling automated diagnostic processes. Traditional convolutional neural networks (CNNs) often struggle with capturing long-range dependencies, while transformer-based architectures, despite their effectiveness, come with increased computational complexity. Recent efforts have focused on combining CNNs and transformers to balance performance and efficiency, but existing approaches still face challenges in achieving high segmentation accuracy while maintaining low computational costs. Furthermore, many methods underutilize the CNN encoder's capability to capture local spatial information, concentrating primarily on mitigating long-range dependency issues. To address these limitations, we propose QTSeg, a novel architecture for medical image segmentation that effectively integrates local and global information. QTSeg features a dual-mix attention decoder designed to enhance segmentation performance through: (1) a cross-attention mechanism for improved feature alignment, (2) a spatial attention module to capture long-range dependencies, and (3) a channel attention block to learn inter-channel relationships. Additionally, we introduce a multi-level feature distribution module, which adaptively balances feature propagation between the encoder and decoder, further boosting performance. Extensive experiments on five publicly available datasets covering diverse segmentation tasks, including lesion, polyp, breast cancer, cell, and retinal vessel segmentation, demonstrate that QTSeg outperforms state-of-the-art methods across multiple evaluation metrics while maintaining lower computational costs. Our implementation can be found at: https://github.com/tpnam0901/QTSeg (v1.0.0)

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a novel hybrid CNN-transformer architecture integrating query tokens for efficient 2D medical image segmentation.
It employs a feature pyramid encoder, multi-level fusion module, and query token-based decoder, achieving 92.42% Dice and 86.74% IoU on ISIC2016.
The design reduces computational complexity while capturing long-range dependencies, paving the way for resource-efficient clinical deployments.

Analysis of QTSeg: A Query Token-Based Architecture for Efficient 2D Medical Image Segmentation

The paper "QTSeg: A Query Token-Based Architecture for Efficient 2D Medical Image Segmentation" presents a novel architecture that amalgamates the strengths of convolutional neural networks (CNNs) and transformer models to address the shortcomings in existing methods for medical image segmentation. This study introduces a query token-based hybrid architecture designed to extract high-accuracy segmentations from 2D medical images while minimizing computational costs.

The Challenge of Long-Range Dependencies

The authors acknowledge the effectiveness of CNNs in achieving pixel-level precision in segmenting regions of interest in medical imagery. However, a recognized limitation in CNNs is their ineffectiveness in capturing long-range dependencies, a facet where transformer architectures exhibit superiority through their attention mechanisms. The transformer models indeed facilitate the handling of long-range dependencies, enhancing the segmentation quality. Yet, these models often suffer from quadratic computational complexity, making them less feasible for high-resolution medical images due to the increased demand on computational resources.

Proposed Solution: QTSeg Architecture

The proposed QTSeg architecture integrates a feature pyramid network (FPN) for image encoding, a multi-level feature fusion (MLFF) module for adaptive communication between encoder and decoder, and a multi-query mask decoder (MQM Decoder) to efficiently generate the segmentation mask.

Encoder: By utilizing FPN architecture, QTSeg extracts multi-scale features from the input, a design motivated by its capacity to construct rich hierarchical feature representations essential for medical image processing.
Adaptive Module: The MLFF module ensures optimal adaptation and fusion of features across different levels of the encoder, thereby bridging the gap between local feature extraction (by CNN) and global attentiveness (by transformers).
Decoder: The MQM Decoder employs query tokens, serving as an advanced decoding mechanism to refine segmentation masks through integrated attention across feature levels. This approach aligns with strategies found effective in the Segment Anything Model (SAM) project, where query tokens are leveraged to synthesize target outcomes from varied feature embeddings.

Experimental Validation and Results

The empirical evaluation demonstrated that QTSeg outperformed existing methodologies across all key metrics, displaying lower computational demands relative to the state-of-the-art benchmarks. On datasets such as ISIC2016, BUSI, and BKAI-IGH NeoPolyp, the architecture robustly excelled in Dice and IoU scores while maintaining a reduced complexity highlighted by lower parameters and FLOPs. Notably, QTSeg achieved 92.42% Dice and 86.74% IoU on the ISIC2016 dataset with significant efficiency.

Implications and Future Directions

The theoretical and practical contributions of this study are noteworthy. From a theoretical perspective, QTSeg exemplifies the potential of harmonizing CNN's local context capture with the transformative power of attention mechanisms intrinsic to transformers, thus addressing inherent weaknesses in CNN architectures concerning long-range dependencies. Practically, the model's design paves the way for faster and resource-efficient deployments in medical imaging settings, especially where computational resources are at a premium or when dealing with high-resolution scans.

This work lays a foundation for imperative future explorations in AI-driven segmentation tasks. One potential area of further research is the exploration of QTSeg's application beyond medical imaging, possibly extending to other domains requiring precision segmentation in high-dimensional data. Furthermore, integrating pre-trained models into QTSeg's architecture represents an opportunity to enhance scalability and adaptability, particularly if these models capitalize on domain-specific knowledge.

The research, by addressing limitations in both existing CNN and transformer models, provides a solid groundwork for advancing the segmentation tasks in medical imaging, with far-reaching implications across AI-assisted diagnostics and treatment planning.