PolypSeg-GradCAM U-Net

Updated 28 December 2025

The paper demonstrates that integrating GradCAM into the U-Net architecture significantly enhances polyp segmentation accuracy while retaining real-time performance.
It leverages efficient encoder-decoder principles, refined feature fusion, and re-routed skip connections to achieve state-of-the-art metrics in biomedical segmentation.
The approach's robustness is underpinned by optimized training protocols and adaptable design, enabling effective deployment in diverse clinical imaging applications.

HarDNet-MSEG is a convolutional neural network architecture designed for real-time semantic segmentation, with particular focus on medical image analysis tasks such as polyp and ulcer segmentation. The architecture integrates a low-memory-traffic encoding backbone (HarDNet68) with a lightweight decoder, originally based on a cascaded partial decoding paradigm. HarDNet-MSEG has established state-of-the-art accuracy and inference speed on multiple medical segmentation benchmarks. It has also served as the foundation for adaptations in broader biomedical segmentation challenges.

1. Encoder Architecture: HarDNet68

HarDNet68 is a variant of DenseNet68 that optimizes connection topology to achieve lower memory traffic while preserving or increasing accuracy. HarD blocks—core computational units in HarDNet—replace dense connectivity with a sparser connectivity pattern. Specifically, each layer $k$ within a HarD block takes as input the concatenation of outputs from layers whose indices differ from $k$ by powers of two (i.e., $k-1$ , $k-2$ , $k-4$ , $k-8$ , …) in addition to the block’s initial input. Each such layer performs a 1×1 convolution for channel reduction followed by a 3×3 convolution.

Mathematically, if $X_0$ is the block input, for layer $k$ , the input is

$X_0 \oplus \{X_{k-1}, X_{k-2}, X_{k-4}, ...\}$

where $\oplus$ denotes channel-wise concatenation and each output $k$ 0 has a fixed growth size $k$ 1.

HarDNet68 is structured as follows (number of layers and channels as described for ImageNet pretraining):

Initial stem: 3×3 convolution(s), stride 2, batch normalization and ReLU, yielding 32–64 channels at $k$ 2 spatial resolution.
Four sequential HarD blocks:
- Stage 1: ~4 layers, $k$ 364 channels ( $k$ 4 resolution)
- Transition: 1×1 conv, stride 2
- Stage 2: ~8 layers, $k$ 5224 channels ( $k$ 6 resolution)
- Transition: 1×1 conv, stride 2
- Stage 3: ~16 layers, $k$ 7640 channels ( $k$ 8 resolution)
- Transition: 1×1 conv, stride 2
- Stage 4: ~16 layers, $k$ 91024 channels ( $k-1$ 0 resolution)

This design balances computational efficiency with representational capacity, supporting high inference throughput (Huang et al., 2021).

2. Decoder Architecture: Cascaded Partial Decoder and Variants

The original HarDNet-MSEG decoder is adapted from the Cascaded Partial Decoder (CPD) architecture, optimized for efficiency on salient object segmentation tasks. Unlike classical U-Net designs that fuse all encoder stages symmetrically, the CPD approach processes only the three deepest feature levels (stages 2, 3, 4), discarding the shallowest high-resolution feature maps to reduce compute.

Each shallower feature map undergoes a Receptive Field Block (RFB):

Parallel 3×3 convolutions with dilation rates {1, 3, 5}
A 1×1 convolution branch
Concatenated outputs are compressed by a final 1×1 convolution

Decoder fusion proceeds as follows: the deepest encoder feature ( $k-1$ 1 at $k-1$ 2 resolution) is upsampled and multiplied (element-wise) with the RFB-processed next-shallower map, repeating the multiplication and upsampling iteratively through the hierarchy ( $k-1$ 3 at $k-1$ 4, $k-1$ 5 at $k-1$ 6). Only $k-1$ 7 and $k-1$ 8 are reintroduced via skip connections.

A 1×1 convolution and a final upsampling restore full-resolution segmentation logits.

In the context of the Diabetic Foot Ulcer Challenge 2022, a top-performing adaptation replaced the CPD+RFB decoder with a Lawin Transformer block for multi-scale context aggregation and re-routed skip connections to better capture mid-level features, demonstrating architectural flexibility (Kendrick et al., 2023).

3. Training Protocols and Loss Functions

HarDNet-MSEG’s training methodology varies to align with dataset splits inherited from prior benchmarks:

Kvasir-SEG “Jha et al. (2020 real-time)” split: 880 training, 120 test images. Input size: $k-1$ 9. Optimizer: SGD; learning rate: $k-2$ 0; 100 epochs. Data augmentation: random rotation, horizontal and vertical flips.
PraNet split: 1,450 training images and five benchmark test sets. Input size: $k-2$ 1. Optimizer: Adam; learning rate: $k-2$ 2; 100 epochs; no augmentation specified.
Batch sizes are not reported.

Performance is measured by mean Dice (mDice), mean Intersection over Union (mIoU), Precision, Recall, F2-score: $k-2$ 3

$k-2$ 4

$k-2$ 5

$k-2$ 6

$k-2$ 7

The segmentation loss function is not explicitly stated; a plausible implication is that standard practice (weighted sum of pixelwise cross-entropy and Dice losses) likely applies, but the HarDNet-MSEG paper does not provide a formula.

4. Performance Analysis Across Benchmarks

HarDNet-MSEG delivers high segmentation accuracy and real-time inference speeds, as illustrated below:

Dataset (Split, Input)	mDice	mIoU	Precision	Recall	F2	Accuracy	FPS
Kvasir-SEG (880/120, 512×512)	0.904	0.848	0.907	0.923	0.915	0.969	86.7
Kvasir-SEG (PraNet, 312×312)	0.912	0.857	-	-	-	-	88
CVC-ClinicDB (PraNet, 312×312)	0.932	0.882	-	-	-	-	88
CVC-ColonDB (PraNet, 312×312)	0.731	0.660	-	-	-	-	88
ETIS-Larib (PraNet, 312×312)	0.677	0.613	-	-	-	-	88
EndoScene (CVC-T) (PraNet, 312×312)	0.887	0.821	-	-	-	-	88

On Kvasir-SEG, HarDNet-MSEG surpasses U-NetResNet34 and exceeds PraNet in speed, establishing new accuracy and throughput standards in this domain (Huang et al., 2021).

In the DFUC 2022 Challenge, a HarDNet-MSEG adaptation achieved a Dice of 0.7287 and Jaccard 0.6252, outperforming fully convolutional baselines by a 0.1579 Dice margin (Kendrick et al., 2023).

5. Design Adaptations and Extensions

In competitive settings, HarDNet-MSEG’s modular encoder-decoder framework has supported further optimization:

The Lawin Transformer decoder, as integrated in DFUC 2022, utilizes large-window cross-attention for enhanced multi-scale context modeling in lesion segmentation tasks (Kendrick et al., 2023).
Feature fusion strategies have been modified by shifting skip connections towards more semantically meaningful intermediate encoder stages, thereby enhancing representation of spatially-varying biomedical features.
Post-processing, such as morphological hole-filling and small component elimination, is routinely applied to output masks to enhance clinical interpretability.

This suggests architectural flexibility and adaptivity to task-specific requirements while retaining the efficiency advantages of the original backbone.

6. Context and Impact

HarDNet-MSEG is part of a lineage of efficient semantic segmentation networks, inheriting dense connectivity concepts while prioritizing memory- and compute-efficiency. The encoder's sparse but strategic connectivity pattern and efficient decoder design trade minimal accuracy for substantial speed increases, enabling real-time deployment on high-resolution medical imagery. Its state-of-the-art performance on medical imaging challenges has established HarDNet-MSEG as a reference backbone for segmentation in endoscopy and wound analysis (Huang et al., 2021, Kendrick et al., 2023).

The architecture’s impact extends to its role as a high-performance baseline in clinical competitions, inspiring further research into hybrid transformer-convolutional decoders and data-centric adaptation protocols. The reported results confirm its utility across a spectrum of biomedical segmentation regimes.

7. Limitations and Open Parameters

Certain implementation details are dataset- or competition-specific and lack full transparency in published sources. Neither batch size nor segmentation loss terms are always specified. In high-level challenge reports (e.g., DFUC 2022), modifications to feature fusion and decoder design are not published in full architectural detail, and learning rate schedules and augmentation protocols are often omitted (Kendrick et al., 2023). A plausible implication is that real-world reproducibility may require consulting unofficial code releases or additional supplementary materials.

No ablation studies isolating the contributions of decoder variants, skip connection rerouting, or channel balancing modifications are reported in summary papers to date, leaving some effects unquantified.

References:

"HarDNet-MSEG: A Simple Encoder-Decoder Polyp Segmentation Neural Network that Achieves over 0.9 Mean Dice and 86 FPS" (Huang et al., 2021).
"Diabetic Foot Ulcer Grand Challenge 2022 Summary" (Kendrick et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

HarDNet-MSEG: A Simple Encoder-Decoder Polyp Segmentation Neural Network that Achieves over 0.9 Mean Dice and 86 FPS (2021)

Diabetic Foot Ulcer Grand Challenge 2022 Summary (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PolypSeg-GradCAM U-Net.