SAGE-UNet: Adaptive Expert Segmentation Model
- The model introduces shape-adapting gated experts that dynamically select between CNN and Transformer modules for input-specific processing.
- It employs a dual-path fusion strategy with learned gating to balance backbone representations and specialized expert outputs.
- State-of-the-art performance is demonstrated with Dice scores above 94% on benchmarks like EBHI, DigestPath, and GlaS.
The SAGE-UNet architecture is a dynamically routed, dual-path encoder-decoder segmentation model that introduces Shape-Adapting Gated Experts (SAGE) for input-adaptive computation in heterogeneous visual networks. SAGE-UNet is designed to address the challenges of cellular heterogeneity in medical imaging, particularly for colonoscopic lesion segmentation, by adaptively selecting among a pool of heterogeneous experts (CNNs and Transformers) at every encoder level. Key innovations include a hierarchical gating and selection mechanism, a dual-path fusion strategy, and a Shape-Adapting Hub (SA-Hub) for seamless feature translation between diverse expert modules. The framework achieves state-of-the-art segmentation performance on EBHI, DigestPath, and GlaS medical benchmarks, with Dice scores of 95.57%, 95.16%, and 94.17%, respectively, highlighting its efficacy in robust domain generalization and flexible allocation of computation (Thai et al., 23 Nov 2025).
1. Dual-Path Expert-Backbone Fusion
At the core of SAGE-UNet is the replacement of each static encoder block with a two-path module:
- Main path (backbone stream): At each encoder layer , the forward propagation through the pretrained backbone is preserved as .
- Expert path: The same input is dynamically routed through a selected subset of expert modules—comprising $4$ shared and $16$ fine-grained experts—based on hierarchical gating, producing an enriched feature .
- Dual-path fusion: The layer output is a convex combination:
where is a learned gate. defaults to the backbone, while amplifies expert influence. This mechanism enables SAGE-UNet to fall back to the pretrained backbone in regions requiring standard representations and to invoke experts for fine-grained or globally ambiguous regions.
2. Hierarchical Dynamic Expert Routing
SAGE-UNet employs a two-level, input-adaptive expert selection algorithm:
- High-level gating: A lightweight gate computes , with being the global average pooled input. biases the expert selection toward shared () or fine-grained () experts depending on its value.
- Semantic Affinity Routing (SAR): Computes expert logits via scaled dot-product attention with additive input-dependent noise to promote diversity:
- Logit modulation: The logits are shifted using and a binary mask :
- Top-K selection: The top experts (per layer) are selected by indices of the largest entries of , and their outputs are weighted and combined:
where . This routing enables the model to adaptively select experts specialized for current input structure and semantics (Thai et al., 23 Nov 2025).
3. Shape-Adapting Hub for Heterogeneous Expert Integration
The SA-Hub facilitates translation between feature representations expected by diverse experts (CNN and Transformer):
- Input adapter : Transforms into the expert-specific input space, through reshaping, patchifying, or projection: .
- Expert execution: Expert computes its output .
- Output adapter : Projects the expert output back to the backbone-compatible space: .
- Expert path fusion: The overall expert feature is the weighted sum of selected experts:
This approach ensures compatibility among experts with disparate architectures and input-output formats, removing the need for excessive manual tuning when incorporating heterogeneous modules (Thai et al., 23 Nov 2025).
4. Architectural Integration within the UNet Framework
SAGE-UNet maintains the canonical U-Net encoder-decoder structure, with specific modifications to the encoder:
- Stem: The input is processed via an initial stem to obtain .
- Encoder: For encoder depths , each block implements the dual-path SAGE module, collecting features at each scale.
- Skip-connections: Multiscale encoder outputs are forwarded to the decoder for spatially-resolved fusion.
- Decoder: The decoder utilizes standard U-Net upsampling and concatenation operations, fusing skip-connected features for refined spatial localization.
- Segmentation head: Pixel-wise prediction is performed by the final head on the decoder output.
Within any encoder stage, the selected experts can comprise CNN or Transformer architectures depending on the input, dynamically balancing local and global feature extraction. This design enables SAGE-UNet to flexibly adapt capacity allocation and computational routing according to input complexity (Thai et al., 23 Nov 2025).
5. Hyperparameter and Configuration Summary
The main configuration parameters are as follows:
| Parameter | Value/Description |
|---|---|
| Total experts | 20 |
| Shared experts | 4 |
| Fine-grained experts | 16 |
| Top-K per layer | 4 |
| Channel dims | As in ConvNeXt/ViT: 96,192,384,768 |
| Query/key dim | 64 or 128 |
| Expert type (heterogeneous) | CNN and Transformer |
Gating between paths and among experts is implemented using soft sigmoid gates and Top-K thresholding. These design details are selected to optimize segmentation efficiency and adaptivity across scales and visual complexities (Thai et al., 23 Nov 2025).
6. Adaptivity for Local-Global Feature Balancing
SAGE-UNet is designed to dynamically allocate focus based on spatial and semantic complexity:
- In early, shallow layers associated with local pattern extraction (edges, textures), the high-level gate is learned to be large, biasing selection toward shared CNN experts.
- In deeper layers (typically Transformer-based), approaches 0.5, promoting a blend of shared/global and fine-grained/context-aware experts.
- The semantic affinity routing logits, modulated by , ensure optimal selection for each spatial context. Dual-path fusion using enables the model to interpolate between backbone-like and expert-driven representations at each scale.
Segmentation of simple image regions proceeds through the main backbone, whereas complex or ambiguous regions invoke additional computation via experts tailored to either local or global content. This adaptivity underpins the model’s robust generalization to diverse histopathology benchmarks (Thai et al., 23 Nov 2025).