SAGE-UNet: Adaptive Expert Segmentation Model

Updated 30 December 2025

The model introduces shape-adapting gated experts that dynamically select between CNN and Transformer modules for input-specific processing.
It employs a dual-path fusion strategy with learned gating to balance backbone representations and specialized expert outputs.
State-of-the-art performance is demonstrated with Dice scores above 94% on benchmarks like EBHI, DigestPath, and GlaS.

The SAGE-UNet architecture is a dynamically routed, dual-path encoder-decoder segmentation model that introduces Shape-Adapting Gated Experts (SAGE) for input-adaptive computation in heterogeneous visual networks. SAGE-UNet is designed to address the challenges of cellular heterogeneity in medical imaging, particularly for colonoscopic lesion segmentation, by adaptively selecting among a pool of heterogeneous experts (CNNs and Transformers) at every encoder level. Key innovations include a hierarchical gating and selection mechanism, a dual-path fusion strategy, and a Shape-Adapting Hub (SA-Hub) for seamless feature translation between diverse expert modules. The framework achieves state-of-the-art segmentation performance on EBHI, DigestPath, and GlaS medical benchmarks, with Dice scores of 95.57%, 95.16%, and 94.17%, respectively, highlighting its efficacy in robust domain generalization and flexible allocation of computation (Thai et al., 23 Nov 2025).

1. Dual-Path Expert-Backbone Fusion

At the core of SAGE-UNet is the replacement of each static encoder block with a two-path module:

Main path (backbone stream): At each encoder layer $i$ , the forward propagation through the pretrained backbone is preserved as $z_i^{(\mathrm{main})} = f_i(z_{i-1})$ .
Expert path: The same input $z_{i-1}$ is dynamically routed through a selected subset of $M=20$ expert modules—comprising $4$ shared and $16$ fine-grained experts—based on hierarchical gating, producing an enriched feature $z_i^{(\mathrm{expert})}$ .
Dual-path fusion: The layer output is a convex combination:

$z_i = \alpha_i\,z_i^{(\mathrm{main})} + (1-\alpha_i)\,z_i^{(\mathrm{expert})}$

where $\alpha_i = \sigma(\theta_i)$ is a learned gate. $\alpha_i \approx 1$ defaults to the backbone, while $\alpha_i \approx 0$ amplifies expert influence. This mechanism enables SAGE-UNet to fall back to the pretrained backbone in regions requiring standard representations and to invoke experts for fine-grained or globally ambiguous regions.

2. Hierarchical Dynamic Expert Routing

SAGE-UNet employs a two-level, input-adaptive expert selection algorithm:

High-level gating: A lightweight gate computes $g_s = \sigma(\bar z_{i-1}W_{\mathrm{gate}}^{(i)} + b_{\mathrm{gate}}^{(i)})$ , with $\bar z_{i-1}$ being the global average pooled input. $g_s$ biases the expert selection toward shared ( $\mathcal E_{\mathrm{shared}}$ ) or fine-grained ( $\mathcal E_{\mathrm{fine}}$ ) experts depending on its value.
Semantic Affinity Routing (SAR): Computes expert logits $\mathbf{L}_i$ via scaled dot-product attention with additive input-dependent noise to promote diversity:

$\mathbf{L}_i = \frac{(\bar z_{i-1}W_Q^{(i)})K^{(i)\top}}{\sqrt{d_k}} + \sigma_{\mathrm{noise}}^{(i)}\odot \boldsymbol\epsilon^{(i)},\,\, \boldsymbol\epsilon^{(i)}\sim\mathcal N(0,I)$

Logit modulation: The logits are shifted using $g_s$ and a binary mask $\mathbf m_{\mathrm{shared}}$ :

$\mathbf{L}_i' = \mathbf{L}_i + \mathbf m_{\mathrm{shared}}\log(g_s) + (1-\mathbf m_{\mathrm{shared}})\log(1-g_s)$

Top-K selection: The top $K=4$ experts (per layer) are selected by indices of the $K$ largest entries of $\mathbf{L}_i'$ , and their outputs are weighted and combined:

$w_j = \begin{cases} \frac{\exp((\mathbf{L}_i')_j)}{\sum_{k\in\mathcal{I}} \exp((\mathbf{L}_i')_k)}, & j\in\mathcal{I}\ 0, & \text{otherwise} \end{cases}$

where $\mathcal{I} = \text{TopKIndices}(\mathbf{L}_i', K)$ . This routing enables the model to adaptively select experts specialized for current input structure and semantics (Thai et al., 23 Nov 2025).

3. Shape-Adapting Hub for Heterogeneous Expert Integration

The SA-Hub facilitates translation between feature representations expected by diverse experts (CNN and Transformer):

Input adapter $S_\mathrm{in}$ : Transforms $z_{i-1}$ into the expert-specific input space, through reshaping, patchifying, or projection: $\tilde z_{i-1}^{(k)} = S_\mathrm{in}(z_{i-1}; e_k)$ .
Expert execution: Expert $e_k$ computes its output $u_i^{(k)} = e_k(\tilde z_{i-1}^{(k)})$ .
Output adapter $S_\mathrm{out}$ : Projects the expert output back to the backbone-compatible space: $\hat z_i^{(k)} = S_\mathrm{out}(u_i^{(k)}, z_i^{(\mathrm{main})})$ .
Expert path fusion: The overall expert feature is the weighted sum of selected experts:

$z_i^{(\mathrm{expert})} = \sum_{k\in\mathcal{I}} w_k\,\hat z_i^{(k)}$

This approach ensures compatibility among experts with disparate architectures and input-output formats, removing the need for excessive manual tuning when incorporating heterogeneous modules (Thai et al., 23 Nov 2025).

4. Architectural Integration within the UNet Framework

SAGE-UNet maintains the canonical U-Net encoder-decoder structure, with specific modifications to the encoder:

Stem: The input $X$ is processed via an initial stem to obtain $z_0 = \mathsf{Stem}(X)$ .
Encoder: For encoder depths $i = 1,\ldots, T$ , each block implements the dual-path SAGE module, collecting features $z_i$ at each scale.
Skip-connections: Multiscale encoder outputs $z_i$ are forwarded to the decoder for spatially-resolved fusion.
Decoder: The decoder utilizes standard U-Net upsampling and concatenation operations, fusing skip-connected features for refined spatial localization.
Segmentation head: Pixel-wise prediction is performed by the final head on the decoder output.

Within any encoder stage, the selected experts can comprise CNN or Transformer architectures depending on the input, dynamically balancing local and global feature extraction. This design enables SAGE-UNet to flexibly adapt capacity allocation and computational routing according to input complexity (Thai et al., 23 Nov 2025).

5. Hyperparameter and Configuration Summary

The main configuration parameters are as follows:

Parameter	Value/Description
Total experts $M$	20
Shared experts	4
Fine-grained experts	16
Top-K per layer	4
Channel dims $D$	As in ConvNeXt/ViT: 96,192,384,768
Query/key dim $d_k$	64 or 128
Expert type (heterogeneous)	CNN and Transformer

Gating between paths and among experts is implemented using soft sigmoid gates and Top-K thresholding. These design details are selected to optimize segmentation efficiency and adaptivity across scales and visual complexities (Thai et al., 23 Nov 2025).

6. Adaptivity for Local-Global Feature Balancing

SAGE-UNet is designed to dynamically allocate focus based on spatial and semantic complexity:

In early, shallow layers associated with local pattern extraction (edges, textures), the high-level gate $g_s$ is learned to be large, biasing selection toward shared CNN experts.
In deeper layers (typically Transformer-based), $g_s$ approaches 0.5, promoting a blend of shared/global and fine-grained/context-aware experts.
The semantic affinity routing logits, modulated by $g_s$ , ensure optimal selection for each spatial context. Dual-path fusion using $\alpha_i$ enables the model to interpolate between backbone-like and expert-driven representations at each scale.

Segmentation of simple image regions proceeds through the main backbone, whereas complex or ambiguous regions invoke additional computation via experts tailored to either local or global content. This adaptivity underpins the model’s robust generalization to diverse histopathology benchmarks (Thai et al., 23 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Shape-Adapting Gated Experts: Dynamic Expert Routing for Colonoscopic Lesion Segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAGE-UNet Architecture.