LSSF-Net: Lightweight Segmentation with Self-Awareness, Spatial Attention, and Focal Modulation

Published 3 Sep 2024 in cs.CV and cs.AI | (2409.01572v1)

Abstract: Accurate segmentation of skin lesions within dermoscopic images plays a crucial role in the timely identification of skin cancer for computer-aided diagnosis on mobile platforms. However, varying shapes of the lesions, lack of defined edges, and the presence of obstructions such as hair strands and marker colors make this challenge more complex. \textcolor{red}Additionally, skin lesions often exhibit subtle variations in texture and color that are difficult to differentiate from surrounding healthy skin, necessitating models that can capture both fine-grained details and broader contextual information. Currently, melanoma segmentation models are commonly based on fully connected networks and U-Nets. However, these models often struggle with capturing the complex and varied characteristics of skin lesions, such as the presence of indistinct boundaries and diverse lesion appearances, which can lead to suboptimal segmentation performance.To address these challenges, we propose a novel lightweight network specifically designed for skin lesion segmentation utilizing mobile devices, featuring a minimal number of learnable parameters (only 0.8 million). This network comprises an encoder-decoder architecture that incorporates conformer-based focal modulation attention, self-aware local and global spatial attention, and split channel-shuffle. The efficacy of our model has been evaluated on four well-established benchmark datasets for skin lesion segmentation: ISIC 2016, ISIC 2017, ISIC 2018, and PH2. Empirical findings substantiate its state-of-the-art performance, notably reflected in a high Jaccard index.

Abstract PDF Upgrade to Chat

References (7)

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that LSSF-Net integrates conformer-based focal modulation with self-aware and global spatial attention to achieve high segmentation accuracy using just 0.8M parameters.
The model utilizes innovative components such as CFMA, SAB, and split channel shuffle to enhance both local and global feature extraction in a resource-efficient manner.
Experimental evaluations on multiple datasets confirm LSSF-Net's robust generalization, computational efficiency, and potential for mobile deployment in computer-aided diagnosis.

LSSF-Net: Lightweight Segmentation with Self-Awareness, Spatial Attention, and Focal Modulation

The paper introduces LSSF-Net, a lightweight deep learning architecture designed for accurate skin lesion segmentation in dermoscopic images, with a focus on mobile deployment (2409.01572). The network integrates conformer-based focal modulation attention, self-aware local and global spatial attention, and split channel-shuffle to achieve state-of-the-art performance with a minimal number of parameters (0.8 million). The efficacy of LSSF-Net is validated across multiple benchmark datasets, demonstrating its potential for computer-aided diagnosis on resource-constrained devices.

Introduction to LSSF-Net

The imperative for early and accurate skin lesion detection, particularly melanoma, is well-established. LSSF-Net addresses the challenges posed by the complex characteristics of skin lesions, such as indistinct boundaries, diverse appearances, and subtle variations in texture and color. The architecture leverages an encoder-decoder structure, incorporating several novel components to enhance feature extraction and segmentation accuracy, with the goal of balancing performance and resource efficiency for mobile deployment.

Figure 1: Block diagram of the proposed LSSF-Net. CFMA'' is conformer-based focal modulation attention,SAB'' is the self-attention block, and ``GSA'' is global spatial attention.

Architectural Innovations

LSSF-Net incorporates several key architectural innovations:

Parallel Booster Encoder and Decoder: CNNs extract multiscale feature information, while a booster architecture models global contextual information to establish long-range dependencies.
Conformer-based Focal Modulation Attention (CFMA): Introduced as a skip connection, CFMA enhances the acquisition of detailed global and local feature information during decoding (Figure 2).
Self-Aware Attention Block (SAB) and Global Spatial Attention (GSA): These attention mechanisms refine feature information by capturing contextual relationships and enhancing local contextual information.
Split Channel Shuffle (SCS): This mechanism enhances the flow of information across feature channels, improving overall information flow and model efficiency.
Figure 2: Schematic of the Conformer-based Focal Modulation Attention (CFMA), ``LN'' is the layer normalization.

Implementation Details

The implementation involves four encoder-decoder blocks with specific convolutional operations, maxpooling, and upsampling techniques. The initial skip connection ( $s_o$ ) and encoder block output ( $E_o$ ) are computed using convolutional operations and maxpooling:

$s_{o}=l^{3\times 3}(X_{in})$

$E_{o}=m_{p}\left ( l^{3\times 3}\left ( l^{3\times 3}\left (s_{o} \right ) \right ) \right )$

Subsequent encoder block outputs ( $E_k$ ) are refined using skip connections ( $s_k$ ) and residual learning:

$E_{k}=m_{p}\left [ \Re\left \{ \beta _{n}\left ( f^{3\times 3}\left ( \beta _{n}\left ( f^{3\times 3}\left ( s_{k} \right ) \right ) \right ) \right ) + f^{3\times 3}\left ( l^{3\times 3}\left ( l^{3\times 3} \left ( E_{k-1} \right )\right ) \right )\right \} \right ]$

The decoder stage reconstructs spatial feature maps using CFMA on skip connections:

$\Im _{k} = CFMA(s_{k}) + l^{3\times 3}(u_{p}(D_{k-1}))$

The final output ( $X_{out}$ ) is obtained through convolutional and sigmoid operations:

$X_{out}= \sigma(f^{1\times 1}(l^{3\times}(\Im _{k})))$

Experimental Results

LSSF-Net's performance was evaluated on ISIC 2016, ISIC 2017, ISIC 2018, PH2, BUSI, and DDTI datasets. The evaluation metrics included Jaccard index, Dice coefficient, accuracy, sensitivity, and specificity. Ablation studies on the ISIC 2017 dataset demonstrated the contribution of each component to the overall performance (Figure 3). The results showed that the combination of CFMA, SAB, and SCS-SAB, along with transfer learning, yielded the best performance.

Figure 3: Visual results of ablation study on ISIC 2017 dataset. $1^{st}$ column shows the color image, $2^{nd}$ column shows the corresponding ground truth, $3^{rd}$ column shows the output of baseline network (BN), $4^{th}$ column shows the output of (BN + CFMA), $5^{th}$ column shows the output of (BN + SAB), $6^{th}$ column shows the output of (BN + CFMA + SAB), $7^{th}$ column shows the output of (BN + CFMA + SCS-SAB), and $8^{th}$ column shows the output of (BN + CFMA + SCS-SAB + Transfer Learning).

Visual comparisons on the ISIC 2018 dataset (Figure 4) further illustrate the effectiveness of LSSF-Net in accurately segmenting skin lesions, particularly in challenging scenarios with occlusions and low contrast.

Figure 4: Comparison of the visual performance of the proposed LSSF-Net on ISIC 2018 [codella2019skin] dataset.

Similar visual results on the ISIC 2017 dataset (Figure 5) confirm the superior performance of LSSF-Net in achieving segmentation results that closely align with the ground truth.

Figure 5: Comparison of the visual performance of the proposed LSSF-Net on ISIC 2017 [codella2018skin] dataset.

The ISIC 2016 dataset (Figure 6) also demonstrates LSSF-Net's ability to handle diverse scales and irregular shapes, consistently achieving optimal segmentation results.

Figure 6: Comparison of the visual performance of the proposed LSSF-Net on ISIC 2016 [gutman2016skin] dataset.

Additionally, the generalization capability of LSSF-Net was validated on the PH2 dataset (Figure 7), where it accurately segmented lesion regions despite the presence of hair, contrast variations, and irregular boundary shapes.

Figure 7: Visual results of the proposed LSSF-Net on the PH2 [mendoncca2013ph] dataset.

The BUSI (Figure 8) and DDTI (Figure 9) datasets further demonstrate LSSF-Net's ability to deliver precise segmentation results, even for images exhibiting diverse sizes and irregular shapes.

Figure 8: Comparison of the visual performance of the proposed LSSF-Net on BUSI [BUSIdataset] dataset.

Figure 9: Comparison of the visual performance of the proposed LSSF-Net on DDTI [DDTIdataset] dataset.

Quantitative Analysis

Across the datasets, LSSF-Net consistently outperformed other state-of-the-art methods, achieving higher Jaccard indices and demonstrating robust generalization capabilities through cross-dataset evaluations. The computational complexity analysis revealed that LSSF-Net achieves superior computational efficiency with a minimal number of parameters and reduced inference time.

Limitations and Future Directions

While LSSF-Net excels in binary class segmentation tasks, its lightweight architecture may limit its applicability to more complex problems involving multiple modalities and classes. Future research could focus on extending LSSF-Net to support multiclass segmentation and multimodalities, enhancing its versatility and applicability to various medical imaging scenarios. Additionally, implementing quantization techniques can further optimize LSSF-Net for deployment on resource-constrained devices, making it suitable for CAD systems and mobile devices.

Conclusion

LSSF-Net represents a significant advancement in skin lesion segmentation, offering a balance of accuracy and efficiency suitable for mobile deployment. The architecture's innovative components and robust performance across multiple datasets highlight its potential for real-world applications in computer-aided diagnosis of dermatological conditions. Future research directions include extending the network's capabilities to handle more complex segmentation tasks and optimizing its deployment on resource-constrained devices.

Markdown Report Issue