FE-UNet: Frequency-Enhanced Image Segmentation
- The paper presents FE-UNet, a segmentation framework that enhances mid-frequency feature representation using wavelet-guided spectral pooling and frequency-enhanced receptive field blocks.
- It combines biologically inspired frequency sensitivity modeling with high-capacity pretrained backbones to balance local and global cues in complex imagery.
- Experimental results show mIoU improvements up to 2.6% and mean Dice gains up to 2.9%, demonstrating robust performance on marine animal and polyp segmentation benchmarks.
FE-UNet is a segmentation framework that systematically enhances feature representations across the frequency domain, integrating biologically inspired frequency sensitivity modeling with high-capacity pretrained backbones to achieve state-of-the-art performance in versatile image segmentation contexts. Its core innovations are the Wavelet-Guided Spectral Pooling Module (WSPM) and the Frequency Domain Enhanced Receptive Field Block (FE-RFB), both of which aim to balance the encoding of local and global cues by selectively targeting mid-frequency features, thus mimicking the response of the human visual system (HVS). The architecture incorporates recent advances from the @@@@1@@@@ (SAM2) and adopts Hiera-Large as a frozen backbone, yielding superior generalization and accuracy on complex segmentation benchmarks including marine animal and polyp datasets (Huo et al., 6 Feb 2025).
1. Motivation and Frequency Sensitivity Analysis
FE-UNet was motivated by empirical findings that conventional CNNs are biased toward the extraction of high-frequency image components (edges, fine textures), while Transformer models favor low-frequency, global structures. However, natural and biomedical images exhibit diverse frequency profiles, and segmentation tasks frequently encounter low-frequency or mid-frequency dominated contexts, such as underwater scenes with low contrast or endoscopic images with subtle polyp borders. Human vision, as quantified by the contrast sensitivity function (CSF) of Mannos and Sakrison, achieves maximal sensitivity to mid-frequency information:
By directly comparing the CSF of pretrained ResNet18 (via frequency masking on CIFAR-10) to that of HVS, it was shown that both underperform at very low frequencies but peak at different ranges—HVS at mid, CNNs at mid-high. This observation underpins the development of modules that rebalance attention to frequencies, with a specific emphasis on the mid-frequency band to enhance segmentation robustness across imaging scenarios (Huo et al., 6 Feb 2025).
2. Wavelet-Guided Spectral Pooling Module (WSPM)
WSPM is a compound module comprising two stages—Deep Wavelet Convolution (DWTConv) and a Spectral Pooling Filter (SPF)—designed to inject frequency-targeted transformations into UNet-like encoders.
- DWTConv applies Haar wavelet filters, decomposing input features into low- (LL) and high-frequency (LH, HL, HH) subbands:
The low-frequency channels are enhanced and the subband responses are re-aggregated through inverse wavelet transforms across cascaded levels.
- Spectral Pooling Filter transforms the spatial features into the Fourier domain, splits them by frequency (using a mask of radius ), and selectively mixes low and high-frequency components with a learnable ratio :
This two-stage mixing enhances low-frequency representations and maintains feature salience in the mid-frequency band, rebalancing the otherwise high-frequency dominant activations typical of vanilla CNNs.
3. Frequency Domain Enhanced Receptive Field Block (FE-RFB)
FE-RFB extends the frequency-aware processing of WSPM to multi-scale context aggregation. It employs parallel branches, each with distinct kernel shapes (e.g., , ) and dilations (), simulating HVS-like eccentricity and increasing the diversity of receptive fields:
- Each branch:
- Output: The concatenation of these branches, followed by convolution, integrates multi-scale, frequency-enhanced features.
This configuration enables FE-UNet to efficiently integrate both global context and local details, analogous to the human retina's band-pass response structure.
4. Architecture and Integration of Pretrained Backbones
FE-UNet is built around a U-Net architecture; its encoder employs a frozen Hiera-Large model from the SAM2 segmentation paradigm. For each hierarchy :
- An adapter (MLP: Linear→GeLU→Linear→GeLU) precedes Hiera, supporting parameter-efficient tuning.
- Each is depthwise-convoluted to reduce channel dimension, followed by FE-RFB.
- The decoding pathway consists of standard UNet upsampling and convolutional blocks ("DoubleConv"), with deep supervision via 1×1 convolutions at multiple scales.
The entire architecture enhances mid-frequency signal representation at all U-shape levels.
5. Training Procedures and Loss Functions
Training employs a composite loss over side outputs , each compared to ground truth using weighted IoU and weighted binary cross-entropy:
where
The total loss sums over all three outputs. No additional frequency-domain regularizers are used beyond WSPM's intrinsic spectral mixing.
6. Experimental Results and State-of-the-Art Benchmarks
FE-UNet demonstrates superior performance across challenging segmentation benchmarks:
- Marine animal segmentation
- MAS3K: mIoU = 0.815 (vs. Dual-SAM 0.789; +2.6%)
- RUWI: mIoU = 0.914 (vs. Dual-SAM 0.904; +1.0%)
- Polyp segmentation
- Kvasir: mean Dice = 0.929 (vs. CFA-Net 0.915; +1.4%)
- CVC-ColonDB: mean Dice = 0.804 (vs. CaraNet 0.775; +2.9%)
Ablations indicate that omitting WSPM reduces performance by 3–5% mIoU, and that placement of FE-RFB at all U-shape levels is optimal, primarily affecting levels 2–3.
7. Strengths, Limitations, and Prospective Extensions
FE-UNet's design ensures frequency-aware feature learning, which is particularly robust under low-contrast or noisy imaging conditions. Its mimicry of HVS mid-frequency sensitivity and integration with large pre-trained backbones allow it to generalize effectively with efficient parameter usage. Limitations include increased model complexity and minor computational overhead due to wavelet and DFT operations, and the need for tuning the mixing parameter in SPF per dataset. Potential future directions include substitution of alternative or learnable wavelet bases, dynamic (potentially learnable) frequency mixing strategies, and application to frequency-challenging domains such as remote sensing and histopathology. Further extensions could explore spectral-domain self-attention for enhanced long-range context modeling (Huo et al., 6 Feb 2025).