Coarse Guidance Network (CGN)
- Coarse Guidance Network (CGN) is a module that injects coarse spatial context into high-resolution patch features to enhance slide-level predictions in MIL frameworks.
- It remaps instance features to a coarse grid using field-of-view driven binning and processes them through a lightweight convolutional head to compute a guidance map.
- Empirical evaluations show that incorporating CGNs improves biomarker classification AUCs while maintaining low parameter and computational overhead.
A Coarse Guidance Network (CGN) is a module designed to learn and inject spatial contextual information at a coarser scale into high-magnification instance features within Multiple-Instance Learning (MIL) frameworks for whole-slide image (WSI) analysis. The CGN operates via grid-based remapping of instance features and a lightweight convolutional head to produce a coarse guidance map, which is then used to modulate the instance features before final attention-based aggregation. This approach enables progressive multi-scale context modeling in computational pathology tasks, offering a parameter-efficient mechanism for slide-level prediction enhancement while maintaining computational tractability (Wu et al., 2 Feb 2026).
1. Architectural Overview
The CGN processes high-magnification patch features and their normalized spatial coordinates . Its core workflow includes three sequential steps:
- Grid-based Remapping: High-magnification features are aggregated into a 3D coarse feature map based on spatial bin assignments determined by a selectable field-of-view (FOV) parameter.
- Convolutional Guidance Head: is processed by two 3×3 convolutions with ReLU activations and a 1×1 convolution with Sigmoid activation to yield the coarse guidance map .
- Patch-level Gating: is flattened and indexed to obtain , which gates each corresponding row in , resulting in modulated features .
The diagrammatic ASCII representation is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
H (N×D), coords (N×2)
│
┌───────┴───────────────────────────┐
│ Grid-based remapping │
│ → M ∈ ℝ^{D×H′×W′}, idx∈ℕ^N │
└───────┬───────────────────────────┘
│
┌───────┴─────────┐
│ Conv3×3(D→D′) │
│ → ReLU │
│ Conv3×3(D′→D′) │
│ → ReLU │
│ Conv1×1(D′→1) │
│ → Sigmoid │
└───────┬─────────┘
│
P ∈ ℝ^{1×H′×W′}
│
flatten & index by idx
│
M_A ∈ ℝ^N
│
H_k = H ⊙ M_A (N×D) |
2. Grid-based Remapping
Instance features and coordinates are mapped to a coarse grid via field-of-view driven binning. For each instance with normalized coordinates %%%%10%%%%, the grid cell assignment is determined as: where , , with the selected FOV.
The feature map is computed by averaging all high-magnification vectors falling into each bin: where collects all instances assigned to grid cell . In vectorized notation,
followed by reshaping to .
3. Convolutional Guidance Computation
After remapping, is passed through three sequential convolutions:
Here, is used as the hidden channel width for all CGN blocks. No self-attention or Transformer module is included; the head is purely convolutional.
is flattened to length , and each instance gathers its coarse guidance value according to its assigned index. The final gated features are .
4. Integration with Attention-based MIL
In standard attention-based MIL settings, instance embeddings propagate through an attention aggregator to yield slide-level predictions: Installing a CGN at scale updates as:
Stacking multiple CGNs (for example, at FOVs ) results in a progressive series of residual updates: The final is then input to attention modules such as ABMIL, DSMIL, CLAM-SB, or CLAM-MB, which conduct the slide-level aggregation.
5. Training Protocol and Hyperparameters
Training details for CGN-based models are as follows:
- Losses: Biomarker tasks (ER, PR, HER2 status) use cross-entropy loss. Prognosis tasks (CRC Surv) use a negative log-likelihood loss (NLLSurvLoss) that combines censored and uncensored terms:
- Optimizer: AdamW, learning rate , cosine-decay scheduler.
- Early stopping: patience = 10.
- Epochs: maximum 150.
- FOV choices: At 20×, e.g., pixels (providing 3 CGNs).
- Hidden channels: per CGN.
- Parameter and compute cost: Each CGN adds approximately $0.6$M parameters per scale.
6. Empirical Performance and Ablation Results
Empirical studies isolating the CGN demonstrate a consistent benefit on multiple biomarker classification tasks. For instance, a single CGN (FOV=1536) added to ABMIL (using CONCH features) produces:
| System | ER AUC (%) | PR AUC (%) | HER2 AUC (%) | Params (M) | FLOPs (G) |
|---|---|---|---|---|---|
| ABMIL w/o CGN (single-scale 20×) | 87.22 | 84.14 | 80.06 | — | — |
| ABMIL + single CGN (FOV=1536) | 88.92 | 84.76 | 80.84 | ~1.51 | ~17.7 |
| ABMIL + three CGNs ([1536,2048,3072]) | 89.76 | 85.24 | 82.86 | ~2.18 | ~17.7 |
| ABMIL + five CGNs ([1024,1536,2048,2560,3072]) | 91.42 | 84.18 | 84.62 | — | — |
Adding at least one CGN leads to a clear increase in slide-level AUC—e.g., gains of +1.70pp (ER), +0.62pp (PR), and +0.78pp (HER2) for a single scale. Stacking multiple CGNs for progressive multi-scale guidance further improves performance (e.g., +4.20pp for ER, +4.56pp for HER2). CGNs achieve these gains at reduced parameter and compute cost relative to methods such as concatenation or cross-scale attention schemes ($2.18$M parameters/$17.7$G FLOPs for CGN vs. $2.61$M/$38.4$G for cross-scale alternatives), while delivering larger accuracy improvements (+3.6pp ER, +4.05pp HER2).
7. Summary of Properties
A CGN remaps high-magnification features to a spatially coarse grid, applies a three-layer convolutional head to compute a coarse attention map, reprojects this map back to the patch level to gate the D-dimensional features, and is trained end-to-end via the same MIL objectives. Each CGN block is lightweight (requiring hidden channels, M parameters per scale), incurs minimal additional computation, and has been shown to consistently improve slide-level prediction performance in clinical biomarker and prognosis tasks (Wu et al., 2 Feb 2026).