Bi-Feature Mask Former (BFMF)

Updated 29 January 2026

The paper introduces a modular Bi-Feature Mask Former (BFMF) that fuses multi-scale features to reduce the semantic gap in skip connections.
It employs parallel multi-kernel convolutions and attention-guided fusion to blend high- and low-resolution feature maps for improved boundary precision.
Empirical evaluations on the BUSI dataset show significant gains in IoU and F1-score, highlighting BFMF's effectiveness in fine-grained medical image segmentation.

The Bi-Feature Mask Former (BFMF) is a modular architectural block introduced in the Phi-SegNet framework for medical image segmentation. BFMF is designed to reduce the semantic gap inherent in skip connections between neighboring encoder stages by multi-scale mask-based fusion of features, thereby enhancing structural fidelity during decoding. It operates at the interface of high-resolution and low-resolution encoder features, systematically blending and attending to relevant details before propagating refined representations to the decoding stages (Ali et al., 22 Jan 2026).

1. Placement and Role within Phi-SegNet

In the overarching topology of Phi-SegNet, BFMF modules are embedded at each skip-connection between successive encoder levels $E_i$ and $E_{i+1}$ for $i=0,\ldots,n-2$ . Given the typical hierarchical structure of a CNN-based encoder, the module explicitly fuses “high-res” features $E_i$ ( $x\in\mathbb{R}^{H\times W\times C}$ ) with “low-res” features $E_{i+1}$ ( $x_s\in\mathbb{R}^{\tfrac{H}{2}\times\tfrac{W}{2}\times C_s}$ ) via a multi-kernel convolutional (MkC) pipeline. The architecture yields two mask-feature maps ( $y$ and $y_s$ ), which are input to an attention-guided fusion block that outputs a refined feature representation $E_{A,i}$ . This refined feature is subsequently utilized in the decoder, promoting effective information reuse and improved localization.

2. Mathematical Operations and Module Formulation

The forward pass through a BFMF module consists of the following distinct stages:

Multi-Kernel Convolution on High-Resolution Input:

$p_{11},\,p_{21},\,p_{31} = \mathrm{MkC}(x)$

Each path uses convolutions with kernel sizes $k\in\{1,3,5\}$ , dilation rates $d\in\{1,2\}$ , and yields $C$ channels.

Feature Concatenation and Upsampling:

$x_{c1} = p_{11} \oplus p_{21} \oplus \mathrm{Up}(x_s)$

Here, $\oplus$ denotes channel-wise concatenation, and $\mathrm{Up}(\cdot)$ upsamples $x_s$ to shape $[H, W, C_s]$ .

Second MkC Pass:

$p_{12},\,p_{22},\,p_{32} = \mathrm{MkC}(x_{c1})$

Mask Generation:

$y = \sigma\Big((p_{12}\oplus p_{22}\oplus p_{31})*W_{3}^{1}*W_{1}^{1}\Big),\quad y_s = \max\Big(\sigma(p_{32})*W_{1}^{1}\Big)$

Here, convolutions $W_{3}^{1}$ and $W_{1}^{1}$ reduce dimensionality and maintain channel alignment; $\sigma$ is the sigmoid, and global max-pooling produces $y_s$ .

Attention-Guided Feature Fusion:

$E_{m,i} = \sigma\left(\mathrm{BN}\left(\mathrm{BN}\left((y_s \oplus y)*W_{3}^{1}\right)*W_{3}^{1}\right)\right),\quad E_{A,i} = E_i \otimes E_{m,i} + E_i$

Batch normalization and final nonlinearities sharpen $E_{m,i}$ , and element-wise multiplication with $E_i$ applies the learned attention mask.

3. Mask Fusion Mechanics and Kernel Strategies

The BFMF employs parallel convolutional paths with:

Kernel sizes: $1\times1$ , $3\times3$ , $5\times5$ .
Dilation rates: $1$ and $2$. For the $5\times5$ kernel with $d=2$ , the receptive field extends to $9\times9$ , permitting broader context aggregation.

Following concatenations, dimensionality reduction is performed through sequential $3\times3$ and $1\times1$ convolutions. Sigmoid activations ensure mask outputs are bounded within [0,1]. Batch normalization is exclusively utilized inside the attention block, implemented twice in succession before the final sigmoid.

All convolutional weights ( $W^d_k$ ) and normalization parameters are jointly optimized end-to-end with the network on the segmentation objective.

4. Implementation Details and Hyperparameters

The integration of BFMF assumes an EfficientNet-B4 encoder backbone, leading to $n$ encoder stages with channel sizes $C\in\{24, 48, 80, 160, \ldots\}$ . The input tensors for each BFMF instance are:

$x\in\mathbb{R}^{H\times W\times C}$
$x_s\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times C_s}$

After the first MkC, three intermediate outputs of shape $[H,W,C]$ are obtained and concatenated with the upsampled low-res branch. Mask generation produces outputs $y$ and $y_s$ appropriate for subsequent attention and fusion steps. No extraneous parameters are introduced beyond those inherent to the convolutional modules. The dilation set is fixed at $\{1,2\}$ .

5. Decoder Integration and Loss Synergy

The features refined by BFMF, $E_{A,i}$ , are concatenated within the decoder pathway after upsampling and merged via a DoubleConv + Up block to yield the current decoder output. Each decoder stage generates an auxiliary “phase boundary” mask $x_{\varphi,i}$ via a $\varphi$ -Conditioner (1×1 conv followed by sigmoid). These masks are subsequently processed by the Reverse Fourier Attention module and incorporated into the phase-aware loss,

$\mathcal{L}_{\varphi} = \sum_{i=0}^{n-1} \|\mathbf{u}[\angle x_{\varphi,i};k] -\mathbf{u}[\angle I_y;k]\|_2 +\|\mathbf{u}[\angle x_{\varphi,i};l] -\mathbf{u}[\angle I_y;l]\|_2,$

which enforces boundary regularization using phase priors. Although $\mathcal{L}_\varphi$ is not applied directly to BFMF outputs, it indirectly sharpens the quality of the features propagated through these blocks by reinforcing structural boundaries downstream.

6. Empirical Effects and Ablation Findings

Ablation studies presented on the BUSI dataset detail the impact of BFMF. When added to an EfficientNet-B4 baseline (using standard segmentation loss $\mathcal{L}_s$ ), BFMF increases IoU from 0.7792 to 0.8170 and F1-score from 0.8574 to 0.9037, indicating 3.78 and 4.63 percentage point improvements, respectively. Supplementing with phase-aware loss $\mathcal{L}_\varphi$ further boosts IoU and F1 to 0.8293 and 0.9106. Visualization of single-channel BFMF masks demonstrates elevated lesion contrast and boundary sharpness relative to unrefined encoder representations, substantiating the contention that BFMF increases the discriminative efficacy of skip-connections for fine-grained object localization (Ali et al., 22 Jan 2026).

7. Context, Motivation, and Broader Implications

BFMF arises from the necessity to bridge semantic discontinuities across adjacent encoder resolutions, a challenge underscored in volumetric medical image analysis where preserving fine structures is critical. The architectural novelty lies in employing a mask-based, multi-scale fusion that systematically reweights spatial and semantic cues prior to decoding. The resultant gains in boundary precision and generalization capability across imaging modalities suggest broader potential for mask-based skip refinement beyond medical domains. A plausible implication is that similar multi-scale mask fusion strategies could augment conventional U-Net-type architectures by ameliorating information loss at resolution transition points, especially where fine object structures are key to performance (Ali et al., 22 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Phi-SegNet: Phase-Integrated Supervision for Medical Image Segmentation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-Feature Mask Former (BFMF).