Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bi-Feature Mask Former (BFMF)

Updated 29 January 2026
  • The paper introduces a modular Bi-Feature Mask Former (BFMF) that fuses multi-scale features to reduce the semantic gap in skip connections.
  • It employs parallel multi-kernel convolutions and attention-guided fusion to blend high- and low-resolution feature maps for improved boundary precision.
  • Empirical evaluations on the BUSI dataset show significant gains in IoU and F1-score, highlighting BFMF's effectiveness in fine-grained medical image segmentation.

The Bi-Feature Mask Former (BFMF) is a modular architectural block introduced in the Phi-SegNet framework for medical image segmentation. BFMF is designed to reduce the semantic gap inherent in skip connections between neighboring encoder stages by multi-scale mask-based fusion of features, thereby enhancing structural fidelity during decoding. It operates at the interface of high-resolution and low-resolution encoder features, systematically blending and attending to relevant details before propagating refined representations to the decoding stages (Ali et al., 22 Jan 2026).

1. Placement and Role within Phi-SegNet

In the overarching topology of Phi-SegNet, BFMF modules are embedded at each skip-connection between successive encoder levels EiE_i and Ei+1E_{i+1} for i=0,,n2i=0,\ldots,n-2. Given the typical hierarchical structure of a CNN-based encoder, the module explicitly fuses “high-res” features EiE_i (xRH×W×Cx\in\mathbb{R}^{H\times W\times C}) with “low-res” features Ei+1E_{i+1} (xsRH2×W2×Csx_s\in\mathbb{R}^{\tfrac{H}{2}\times\tfrac{W}{2}\times C_s}) via a multi-kernel convolutional (MkC) pipeline. The architecture yields two mask-feature maps (yy and ysy_s), which are input to an attention-guided fusion block that outputs a refined feature representation EA,iE_{A,i}. This refined feature is subsequently utilized in the decoder, promoting effective information reuse and improved localization.

2. Mathematical Operations and Module Formulation

The forward pass through a BFMF module consists of the following distinct stages:

  1. Multi-Kernel Convolution on High-Resolution Input:

p11,p21,p31=MkC(x)p_{11},\,p_{21},\,p_{31} = \mathrm{MkC}(x)

Each path uses convolutions with kernel sizes k{1,3,5}k\in\{1,3,5\}, dilation rates d{1,2}d\in\{1,2\}, and yields CC channels.

  1. Feature Concatenation and Upsampling:

xc1=p11p21Up(xs)x_{c1} = p_{11} \oplus p_{21} \oplus \mathrm{Up}(x_s)

Here, \oplus denotes channel-wise concatenation, and Up()\mathrm{Up}(\cdot) upsamples xsx_s to shape [H,W,Cs][H, W, C_s].

  1. Second MkC Pass:

p12,p22,p32=MkC(xc1)p_{12},\,p_{22},\,p_{32} = \mathrm{MkC}(x_{c1})

  1. Mask Generation:

y=σ((p12p22p31)W31W11),ys=max(σ(p32)W11)y = \sigma\Big((p_{12}\oplus p_{22}\oplus p_{31})*W_{3}^{1}*W_{1}^{1}\Big),\quad y_s = \max\Big(\sigma(p_{32})*W_{1}^{1}\Big)

Here, convolutions W31W_{3}^{1} and W11W_{1}^{1} reduce dimensionality and maintain channel alignment; σ\sigma is the sigmoid, and global max-pooling produces ysy_s.

  1. Attention-Guided Feature Fusion:

Em,i=σ(BN(BN((ysy)W31)W31)),EA,i=EiEm,i+EiE_{m,i} = \sigma\left(\mathrm{BN}\left(\mathrm{BN}\left((y_s \oplus y)*W_{3}^{1}\right)*W_{3}^{1}\right)\right),\quad E_{A,i} = E_i \otimes E_{m,i} + E_i

Batch normalization and final nonlinearities sharpen Em,iE_{m,i}, and element-wise multiplication with EiE_i applies the learned attention mask.

3. Mask Fusion Mechanics and Kernel Strategies

The BFMF employs parallel convolutional paths with:

  • Kernel sizes: 1×11\times1, 3×33\times3, 5×55\times5.
  • Dilation rates: $1$ and $2$. For the 5×55\times5 kernel with d=2d=2, the receptive field extends to 9×99\times9, permitting broader context aggregation.

Following concatenations, dimensionality reduction is performed through sequential 3×33\times3 and 1×11\times1 convolutions. Sigmoid activations ensure mask outputs are bounded within [0,1]. Batch normalization is exclusively utilized inside the attention block, implemented twice in succession before the final sigmoid.

All convolutional weights (WkdW^d_k) and normalization parameters are jointly optimized end-to-end with the network on the segmentation objective.

4. Implementation Details and Hyperparameters

The integration of BFMF assumes an EfficientNet-B4 encoder backbone, leading to nn encoder stages with channel sizes C{24,48,80,160,}C\in\{24, 48, 80, 160, \ldots\}. The input tensors for each BFMF instance are:

  • xRH×W×Cx\in\mathbb{R}^{H\times W\times C}
  • xsRH2×W2×Csx_s\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times C_s}

After the first MkC, three intermediate outputs of shape [H,W,C][H,W,C] are obtained and concatenated with the upsampled low-res branch. Mask generation produces outputs yy and ysy_s appropriate for subsequent attention and fusion steps. No extraneous parameters are introduced beyond those inherent to the convolutional modules. The dilation set is fixed at {1,2}\{1,2\}.

5. Decoder Integration and Loss Synergy

The features refined by BFMF, EA,iE_{A,i}, are concatenated within the decoder pathway after upsampling and merged via a DoubleConv + Up block to yield the current decoder output. Each decoder stage generates an auxiliary “phase boundary” mask xφ,ix_{\varphi,i} via a φ\varphi-Conditioner (1×1 conv followed by sigmoid). These masks are subsequently processed by the Reverse Fourier Attention module and incorporated into the phase-aware loss,

Lφ=i=0n1u[xφ,i;k]u[Iy;k]2+u[xφ,i;l]u[Iy;l]2,\mathcal{L}_{\varphi} = \sum_{i=0}^{n-1} \|\mathbf{u}[\angle x_{\varphi,i};k] -\mathbf{u}[\angle I_y;k]\|_2 +\|\mathbf{u}[\angle x_{\varphi,i};l] -\mathbf{u}[\angle I_y;l]\|_2,

which enforces boundary regularization using phase priors. Although Lφ\mathcal{L}_\varphi is not applied directly to BFMF outputs, it indirectly sharpens the quality of the features propagated through these blocks by reinforcing structural boundaries downstream.

6. Empirical Effects and Ablation Findings

Ablation studies presented on the BUSI dataset detail the impact of BFMF. When added to an EfficientNet-B4 baseline (using standard segmentation loss Ls\mathcal{L}_s), BFMF increases IoU from 0.7792 to 0.8170 and F1-score from 0.8574 to 0.9037, indicating 3.78 and 4.63 percentage point improvements, respectively. Supplementing with phase-aware loss Lφ\mathcal{L}_\varphi further boosts IoU and F1 to 0.8293 and 0.9106. Visualization of single-channel BFMF masks demonstrates elevated lesion contrast and boundary sharpness relative to unrefined encoder representations, substantiating the contention that BFMF increases the discriminative efficacy of skip-connections for fine-grained object localization (Ali et al., 22 Jan 2026).

7. Context, Motivation, and Broader Implications

BFMF arises from the necessity to bridge semantic discontinuities across adjacent encoder resolutions, a challenge underscored in volumetric medical image analysis where preserving fine structures is critical. The architectural novelty lies in employing a mask-based, multi-scale fusion that systematically reweights spatial and semantic cues prior to decoding. The resultant gains in boundary precision and generalization capability across imaging modalities suggest broader potential for mask-based skip refinement beyond medical domains. A plausible implication is that similar multi-scale mask fusion strategies could augment conventional U-Net-type architectures by ameliorating information loss at resolution transition points, especially where fine object structures are key to performance (Ali et al., 22 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-Feature Mask Former (BFMF).