Phi-SegNet: Phase-Aware Medical Segmentation

Updated 29 January 2026

The paper presents a CNN that leverages Fourier phase priors in both architecture and loss to enhance boundary detection in medical images.
The Bi-Feature Mask Former and Reverse Fourier Attention modules fuse multi-scale features and apply phase-aware refinement to capture fine spatial details.
Quantitative evaluations show an average IoU improvement of +1.54% over competitors, confirming robust performance across various imaging modalities.

Phi-SegNet is a @@@@1@@@@ (CNN) architecture designed for medical image segmentation, distinguished by its explicit integration of spectral (Fourier phase) priors into both architectural design and loss function. By leveraging phase-aware supervision, Phi-SegNet aims to improve boundary fidelity and generalization across diverse imaging modalities, addressing limitations of traditional spatial-only encoding schemes prevalent in CNN and Transformer-based segmentation networks (Ali et al., 22 Jan 2026).

1. Architectural Foundations and Motivation

Phi-SegNet adopts an encoder-decoder paradigm with EfficientNet-B4 (pre-trained on ImageNet) serving as its encoder, optimized for robust feature extraction. The decoder pathways are augmented by two critical modules: Bi-Feature Mask Former (BFMF) for multi-scale feature fusion, and Reverse Fourier Attention (RFA) for phase-regularized spectral refinement. Single-channel “boundary masks” are produced at every decoder stage by phase (φ)-conditioners, which serve both as inputs to RFA and direct targets for spectral supervision. The explicit motivation for phase integration lies in the Fourier phase spectrum’s capacity to encode spatial alignment and contour geometry, thus facilitating the learning of sharper, more consistent object boundaries, particularly advantageous for medical images characterized by low contrast and noise.

2. Bi-Feature Mask Former (BFMF) Module

The BFMF module fuses adjacent encoder features to bridge semantic gaps between high-resolution spatial details and lower-resolution semantic context. Specifically:

Inputs: High-resolution encoder feature $x \in \mathbb{R}^{H\times W\times C}$ ; low-resolution adjacent encoder feature $x_s \in \mathbb{R}^{\frac{H}{2}\times \frac{W}{2}\times C_s}$ .
Multi-Kernel Convolutions (MkC): Three parallel convolutions with kernel sizes $\{1,3,5\}$ and dilations $\{1,2\}$ are applied to each input.
Feature Blending:

$(p_{11}, p_{21}, p_{31}) = \mathrm{MkC}(x), \quad x_{c1} = p_{11} \oplus p_{21} \oplus \mathrm{Up}(x_s)$

where $\oplus$ indicates channel-wise concatenation and $\mathrm{Up}(\cdot)$ is 2× upsampling.

Mask Generation:

$y = \sigma((p_{12} \oplus p_{22} \oplus p_{31}) \ast W_3^1 \ast W_1^1), \quad y_s = \max(\sigma(p_{32}) \ast W_1^1)$

Interpretation: This multi-scale fusion encodes fine detail and contextual semantics within the mask features, enhancing sensitivity to boundaries and global structure.

3. Reverse Fourier Attention (RFA) Block

The RFA block is central to spectral supervision and boundary refinement:

φ-Conditioner Output: Each decoder stage produces a boundary mask $x_{\varphi,i} \in [0,1]^{H_i \times W_i}$ .
Reverse Mask and Spectral Filtering: The complementary mask $r_i = 1 - x_{\varphi,i}$ undergoes 2D Discrete Fourier Transform ( $\mathcal{F}$ ), followed by a centered low-pass filter (LPF) with cutoff $\gamma$ :

$\hat X_i[k,l] = \begin{cases} (\mathcal{F}(r_i))[k,l], & |k-\tfrac{M}{2}| \le \gamma/2,\; |l-\tfrac{N}{2}| \le \gamma/2 \ 0, & \text{otherwise} \end{cases}$

After inverse transform and modulus:

$\hat x_{\varphi,i} = |\mathcal{F}^{-1}(\hat X_i)|$

Attention Application: Decoder features $x_{d,i}$ are modulated element-wise:

$x_{rfa,i} = (x_{d,i} \otimes \hat x_{\varphi,i}) \ast W_3^1$

Significance: The LPF prioritizes dominant low-frequency structures, denoising edges and guiding the network toward structural form rather than spurious detail.

4. Phase-Aware Supervision and Loss Function

Phi-SegNet employs a dual-component loss to balance spectral coherence with spatial overlap:

Phase Extraction: For each decoder stage $i$ , phase unwrapping is performed:

$\varphi_{K,i} = \mathbf{u}[\angle x_{\varphi,i};\, \text{rows}], \quad \varphi_{L,i} = \mathbf{u}[\angle x_{\varphi,i};\, \text{cols}]$

Phase Alignment Loss:

$\mathcal{L}_{\varphi} = \sum_{i=0}^{n-1} \|\varphi_{K,i} - \mathbf{u}[\angle I_{y};\, \text{rows}]\|_2 + \|\varphi_{L,i} - \mathbf{u}[\angle I_{y};\, \text{cols}]\|_2$

where $I_y$ denotes the ground-truth mask.

Spatial (IoU) Loss:

$\mathcal{L}_s = 1 - \frac{\sum_{h,w}I_y(h,w)\hat I_y(h,w)}{\sum_{h,w} I_y(h,w) + \hat I_y(h,w) - I_y(h,w)\hat I_y(h,w)}$

Total Loss:

$\mathcal{L}_{\text{total}} = \alpha\,\mathcal{L}_{\varphi} + \beta\,\mathcal{L}_s$

with $\alpha=0.01$ , $\beta=1.0$ ; φ-conditioner outputs create a closed feedback loop with the RFA blocks.

5. Training and Evaluation Procedures

Training employs Adam with initial learning rate $1 \times 10^{-5}$ , batch size 4, and cosine annealing schedule ( $T_{max}=25, \eta_{min}=1 \times 10^{-7}$ ) over 150 epochs. Image pre-processing consists of resizing to 256×256, random flips, affine transformations, and multi-scale training. Implementation uses PyTorch 2.3.0 and CUDA 12.1 on 4× NVIDIA RTX 2080 Ti GPUs. Evaluation spans five datasets: BUSI (breast US), TDD (dental X-ray), Kvasir-SEG (colon polyps), GLaS (glands), and PROMISE-12 (prostate MRI), with external hold-outs for cross-dataset generalization (UDIAT, Mendeley, CVC-ColonDB).

6. Quantitative Performance and Ablation

Phi-SegNet achieves test IoU and F1 scores:

Dataset	IoU	F1
BUSI	84.54%	91.98%
TDD	85.37%	93.34%
Kvasir-SEG	84.96%	92.24%
GLaS	83.83%	91.49%
PROMISE-12	83.62%	91.50%

The average relative improvement over the next best-performing model is IoU +1.54% ± 1.26%, F1 +0.98% ± 0.71%. Unified training across modalities yields merged IoU of 78.97% and F1 of 87.52%. Cross-dataset tests maintain high performance, e.g., IoU of 70.43% on unseen UDIAT data. Ablation analyses confirm that each module—phase loss, BFMF, RFA—contributes consistently to performance gains.

7. Limitations, Insights, and Future Directions

Spectral priors enhance segmentation by encoding geometry and positional cues invariant to contrast fluctuations; RFA blocks denoise high-frequency clutter while retaining structure; phase alignment loss enforces boundary accuracy in frequency space. Current implementation incurs increased parameter count (≈60 M), computational load (≈82 G FLOPs), and uses a fixed LPF cutoff ( $\gamma$ ), which may not be universally optimal. Future work may include lightweight model distillation, adaptive spectral filtering dependent on modality, and semi-supervised or domain-adaptive schemes for enhanced generalization. A plausible implication is that phase-aware supervision could generalize to other tasks requiring fine-grained spatial discrimination in the presence of ambiguous or noisy input (Ali et al., 22 Jan 2026).

In conclusion, Phi-SegNet demonstrates that explicit phase-aware feedback and spectral regularization substantially advance medical image segmentation, enabling robust, modality-agnostic localization even for fine boundaries and structurally complex targets.

Markdown Report Issue Upgrade to Chat

References (1)

Phi-SegNet: Phase-Integrated Supervision for Medical Image Segmentation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Phi-SegNet.