HarDNet-MSEG: Efficient Biomedical Segmentation
- The paper presents a high-speed segmentation CNN, HarDNet-MSEG, achieving over 0.9 mean Dice on Kvasir-SEG at more than 80 FPS on modern GPUs.
- Its encoder employs a low-memory HarDNet68 with selective, sparsified connectivity, while its cascaded decoder with RFB modules enhances segmentation accuracy efficiently.
- HarDNet-MSEG has been adapted for diabetic foot ulcer segmentation using a Lawin Transformer decoder, establishing a robust performance/cost trade-off in biomedical imaging.
HarDNet-MSEG is a convolutional neural network architecture developed for high-accuracy, real-time biomedical image segmentation, targeting tasks such as polyp detection in colonoscopy images. Its design is characterized by an encoder-decoder layout, employing a HarDNet68 backbone to minimize memory traffic and maximize speed, and a lightweight, cascade-style decoder optimized for both efficiency and segmentation accuracy. HarDNet-MSEG achieves state-of-the-art performance across multiple medical image benchmarks, notably exceeding 0.9 mean Dice on Kvasir-SEG at over 80 FPS on contemporary GPUs, and has also been adapted for related challenges such as diabetic foot ulcer segmentation (Huang et al., 2021, Kendrick et al., 2023).
1. Encoder Architecture: HarDNet68 Backbone
The encoder in HarDNet-MSEG is based on HarDNet68, a "low-memory-traffic" convolutional network derived from DenseNet68. HarDNet introduces HarD blocks—a sparsified connectivity pattern where each convolutional layer connects only to a subset of previous layers. Specifically, for a block with layers, a layer receives input from the initial input and layers whose indices differ from by powers of two (i.e., , , , , etc.):
Each such concatenated input undergoes a convolution to fixed growth size , followed by a convolution (stride 1, padding 1). This structure reduces memory traffic relative to DenseNet's full dense connectivity, while selectively widening key layers to maintain accuracy (Huang et al., 2021).
The encoder follows a four-stage structure:
- Conv-stem: convolutions with stride 2 and ReLU/BN activations, producing 32–64 channels at 1/4 input resolution.
- Stages 1–4: Each stage consists of a HarD block and a transition down (1×1 convolution + stride 2), reducing spatial resolution successively to 1/32.
- Typical HarDNet68 parameters on ImageNet are: layers per stage, with channel counts increasing from approximately 64 to 1024 across stages.
2. Decoder Architecture: Cascaded Partial Decoder with RFB Modules
The decoder leverages a cascaded partial design, inspired by CPD [Wu et al., CVPR 2019], processing only the deepest three encoder feature maps (from stages 2–4). Shallow, high-resolution features are omitted to reduce computation.
The deep encoder outputs are upsampled and fused stage-wise:
- The deepest feature map ( at 1/32 resolution) is upsampled and element-wise multiplied with the RFB-processed (1/16 resolution) to yield .
- is similarly upsampled and fused with the RFB-processed (1/8 resolution), yielding .
- A final upsampling and convolution produce the full-resolution segmentation mask.
Receptive Field Blocks (RFBs) enlarge effective receptive field by applying parallel convolutions with multiple dilation rates (1, 3, 5) and a branch, concatenating outputs and compressing with convolution.
Skip connections in this design exclude the shallowest features and introduce only those from mid-to-deep encoder stages via the RFB and element-wise product (Huang et al., 2021).
3. Training, Losses, and Evaluation Metrics
HarDNet-MSEG evaluation uses standard segmentation metrics:
- Mean Dice (mDice):
- Mean Intersection over Union (mIoU):
- Precision:
- Recall:
- F score:
- Accuracy:
The paper does not specify an explicit loss function; weighted sums of pixel-wise cross-entropy and Dice loss are typical in the literature, but this is not confirmed for this network.
Two main training regimes are described:
- Kvasir-SEG "Jha split":
- Input: , SGD, learning rate , 100 epochs, random rotation/flip augmentations.
- PraNet split:
- Input: , Adam, learning rate , 100 epochs, no augmentations.
Batch size is not reported (Huang et al., 2021).
4. Performance Benchmarks and Comparative Results
HarDNet-MSEG delivers state-of-the-art results across five polyp segmentation datasets, maintaining high throughput:
| Dataset | mDice | mIoU | Inference FPS |
|---|---|---|---|
| Kvasir-SEG (512×512) | 0.904 | 0.848 | 86.7 |
| Kvasir-SEG (312×312) | 0.912 | 0.857 | 88 |
| CVC-ClinicDB | 0.932 | 0.882 | 88 |
| CVC-ColonDB | 0.731 | 0.660 | 88 |
| ETIS-Larib Polyp DB | 0.677 | 0.613 | 88 |
| EndoScene (CVC-T) | 0.887 | 0.821 | 88 |
Compared to U-Net[ResNet34] on Kvasir-SEG, HarDNet-MSEG is both more accurate (mDice 0.904 vs. 0.876) and faster (86.7 FPS vs. 35 FPS). Across benchmarks, it outperforms prior networks such as PraNet in both accuracy and speed (Huang et al., 2021).
5. Adaptations and Extensions: DFUC 2022 and Lawin Transformer Decoder
In the Diabetic Foot Ulcer Grand Challenge 2022, a top-performing submission employed a modified HarDNet-MSEG as its backbone. Key modifications included:
- Decoder Replacement: The original CPD+RFB decoder was substituted with a Lawin Transformer block. This module aggregates multi-scale context using large-window cross-attention.
- Skip-connection Rerouting: Instead of fixed skip connections at all encoder stages, a subset of mid-level encoder features was selected for decoder fusion, aiming to better capture ulcer scale.
- Input/Output Channel Balancing: Channel widths were rebalanced at encoder input/output to yield more symmetric feature maps.
The result was a network that achieved a Dice score of 0.7287 and Jaccard of 0.6252 on the DFUC 2022 test set. Morphological hole-filling and removal of small connected components were applied as post-processing. Low-level training details, loss function, and augmentations were not disclosed in the summary (Kendrick et al., 2023).
6. Impact, Adoption, and Limitations
HarDNet-MSEG established a performance/cost trade-off benchmark in polyp segmentation, providing both high mDice and low-latency inference. Its encoder design offers significant memory traffic reduction compared to DenseNet, and its partial decoder further reduces computational overhead.
The architecture’s adaptability was demonstrated in DFU segmentation, where a Lawin Transformer decoder was successfully substituted while preserving HarDNet’s encoder advantages.
Some limitations remain, notably the lack of explicit segmentation loss specification and batch size in the original publication. Detailed architectural parameters (channel/growth rates, etc.) for some adaptation scenarios have not been disclosed, including those for challenge-winning variants. A plausible implication is that proprietary enhancements to decoder or training strategy can further extend the basic framework’s reach, but transparent ablation studies are needed to quantify each modification’s effect (Huang et al., 2021, Kendrick et al., 2023).