Bi-Feature Mask Former (BFMF)
- The paper introduces a modular Bi-Feature Mask Former (BFMF) that fuses multi-scale features to reduce the semantic gap in skip connections.
- It employs parallel multi-kernel convolutions and attention-guided fusion to blend high- and low-resolution feature maps for improved boundary precision.
- Empirical evaluations on the BUSI dataset show significant gains in IoU and F1-score, highlighting BFMF's effectiveness in fine-grained medical image segmentation.
The Bi-Feature Mask Former (BFMF) is a modular architectural block introduced in the Phi-SegNet framework for medical image segmentation. BFMF is designed to reduce the semantic gap inherent in skip connections between neighboring encoder stages by multi-scale mask-based fusion of features, thereby enhancing structural fidelity during decoding. It operates at the interface of high-resolution and low-resolution encoder features, systematically blending and attending to relevant details before propagating refined representations to the decoding stages (Ali et al., 22 Jan 2026).
1. Placement and Role within Phi-SegNet
In the overarching topology of Phi-SegNet, BFMF modules are embedded at each skip-connection between successive encoder levels and for . Given the typical hierarchical structure of a CNN-based encoder, the module explicitly fuses “high-res” features () with “low-res” features () via a multi-kernel convolutional (MkC) pipeline. The architecture yields two mask-feature maps ( and ), which are input to an attention-guided fusion block that outputs a refined feature representation . This refined feature is subsequently utilized in the decoder, promoting effective information reuse and improved localization.
2. Mathematical Operations and Module Formulation
The forward pass through a BFMF module consists of the following distinct stages:
- Multi-Kernel Convolution on High-Resolution Input:
Each path uses convolutions with kernel sizes , dilation rates , and yields channels.
- Feature Concatenation and Upsampling:
Here, denotes channel-wise concatenation, and upsamples to shape .
- Second MkC Pass:
- Mask Generation:
Here, convolutions and reduce dimensionality and maintain channel alignment; is the sigmoid, and global max-pooling produces .
- Attention-Guided Feature Fusion:
Batch normalization and final nonlinearities sharpen , and element-wise multiplication with applies the learned attention mask.
3. Mask Fusion Mechanics and Kernel Strategies
The BFMF employs parallel convolutional paths with:
- Kernel sizes: , , .
- Dilation rates: $1$ and $2$. For the kernel with , the receptive field extends to , permitting broader context aggregation.
Following concatenations, dimensionality reduction is performed through sequential and convolutions. Sigmoid activations ensure mask outputs are bounded within [0,1]. Batch normalization is exclusively utilized inside the attention block, implemented twice in succession before the final sigmoid.
All convolutional weights () and normalization parameters are jointly optimized end-to-end with the network on the segmentation objective.
4. Implementation Details and Hyperparameters
The integration of BFMF assumes an EfficientNet-B4 encoder backbone, leading to encoder stages with channel sizes . The input tensors for each BFMF instance are:
After the first MkC, three intermediate outputs of shape are obtained and concatenated with the upsampled low-res branch. Mask generation produces outputs and appropriate for subsequent attention and fusion steps. No extraneous parameters are introduced beyond those inherent to the convolutional modules. The dilation set is fixed at .
5. Decoder Integration and Loss Synergy
The features refined by BFMF, , are concatenated within the decoder pathway after upsampling and merged via a DoubleConv + Up block to yield the current decoder output. Each decoder stage generates an auxiliary “phase boundary” mask via a -Conditioner (1×1 conv followed by sigmoid). These masks are subsequently processed by the Reverse Fourier Attention module and incorporated into the phase-aware loss,
which enforces boundary regularization using phase priors. Although is not applied directly to BFMF outputs, it indirectly sharpens the quality of the features propagated through these blocks by reinforcing structural boundaries downstream.
6. Empirical Effects and Ablation Findings
Ablation studies presented on the BUSI dataset detail the impact of BFMF. When added to an EfficientNet-B4 baseline (using standard segmentation loss ), BFMF increases IoU from 0.7792 to 0.8170 and F1-score from 0.8574 to 0.9037, indicating 3.78 and 4.63 percentage point improvements, respectively. Supplementing with phase-aware loss further boosts IoU and F1 to 0.8293 and 0.9106. Visualization of single-channel BFMF masks demonstrates elevated lesion contrast and boundary sharpness relative to unrefined encoder representations, substantiating the contention that BFMF increases the discriminative efficacy of skip-connections for fine-grained object localization (Ali et al., 22 Jan 2026).
7. Context, Motivation, and Broader Implications
BFMF arises from the necessity to bridge semantic discontinuities across adjacent encoder resolutions, a challenge underscored in volumetric medical image analysis where preserving fine structures is critical. The architectural novelty lies in employing a mask-based, multi-scale fusion that systematically reweights spatial and semantic cues prior to decoding. The resultant gains in boundary precision and generalization capability across imaging modalities suggest broader potential for mask-based skip refinement beyond medical domains. A plausible implication is that similar multi-scale mask fusion strategies could augment conventional U-Net-type architectures by ameliorating information loss at resolution transition points, especially where fine object structures are key to performance (Ali et al., 22 Jan 2026).