Cross-Phase Lesion-Aware Attention

Updated 20 January 2026

Cross-phase lesion-aware attention is a mechanism that fuses lesion features across imaging phases using dedicated attention modules.
It integrates Q/K/V lesion-guided embedding, learnable phase tokens, and multi-scale fusion to optimize segmentation and diagnostic performance.
The approach has been successfully applied in CT, MRI, and retinal imaging, yielding significant improvements in AUC, Dice scores, and clinical reliability.

Cross-phase lesion-aware attention refers to a set of network mechanisms designed to explicitly model and fuse information about lesions across multiple imaging phases or time points. These architectures marshal anatomical and pathological features, guided by learned attention weights, to optimize diagnostic or segmentation performance in contexts with temporally or parametrically diverse imaging (such as multi-phase CT, multi-sequence MRI, or longitudinal MRI studies). The mechanism is distinguished by its lesion-centric features and its direct modeling of dependencies between imaging phases, as exemplified by 3D inter-phase attention modules, dual-path decoupling, spatial-channel attention gates, and meta-learning-based adaptation strategies.

1. Network Design Principles and Architectural Variants

Cross-phase lesion-aware attention models are motivated by the challenge of capturing phase-dependent enhancement patterns, lesion activities, or co-occurring anatomical/pathological features in imaging studies with multi-phase or multi-sequence acquisition. Architectures generally feature either:

Multiple encoder branches, each processing one phase or timepoint (e.g., non-contrast, arterial, portal, delayed CT phases; baseline and follow-up MRI; multiple MR sequences) (Uhm et al., 2024, Gessert et al., 2020, Himmetoglu et al., 24 Mar 2025).
Lesion segmentation subnets (e.g., 3D U-Net) to localize regions of interest for pooling and fusion (Uhm et al., 2024).
Attention modules inserted either before phase fusion (encoders), at fusion (“cross-phase”), or after fusion (joint segmentation/classification heads).

The most prominent instantiation is LACPANet (Uhm et al., 2024), which operates over $\{I_i\}_{i=1}^N$ 3D CT volumes, producing lesion masks per phase via a 3D U-Net and extracting deep features for attention-based fusion. Decoupling healthy tissue and lesion features, as performed in double-stream MRI segmentation models (Himmetoglu et al., 24 Mar 2025), is also central.

2. Core Mechanisms: Lesion-Guided Feature Embedding and Attention Calculation

Lesion-aware cross-phase attention integrates anatomical priors (lesion masks) and phase context via the following steps:

Q/K/V Lesion-Guided Embedding: Each volume $I_i$ is processed by parallel branches generating $F_i^q, F_i^k, F_i^v$ feature maps ( $H \times W \times D \times C$ ), which are masked via MAP pooling over predicted lesion regions:

$Q_i = \mathrm{MAP}(F_i^q, \hat S_i), \quad K_i = \mathrm{MAP}(F_i^k, \hat S_i), \quad V_i = \mathrm{MAP}(F_i^v, \hat S_i)$

where the MAP operation averages features over the lesion region (Uhm et al., 2024).

Phase Embedding: Learnable tokens $P_i$ are added to $Q_i$ , $K_i$ , $V_i$ to encode phase identity in the attention fusion.
Attention Matrix Fusion: Classic Transformer-style scaled dot-product attention is performed:

$A = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{C}}\right)$

yielding an $N \times N$ matrix quantifying inter-phase influence. Values are fused:

$F_{\mathrm{out}} = \lambda A V + V$

with a residual coefficient $\lambda$ (Uhm et al., 2024).

Variants exist for different clinical and technical contexts. For MS lesion activity segmentation, attention masks $a_{BL}^i$ and $a_{FU}^i$ are computed by shallow conv layers on the concatenated feature maps from baseline and follow-up volumes; these masks suppress irrelevant (old) lesion features and preserve new/enlarged lesion regions (Gessert et al., 2020).

3. Multi-scale and Resolution-aware Extensions

Improved performance is achieved by extending attention fusion to multi-scale feature sets:

Low-level branch ( $F_i^{\cdot,low}$ ) covers high-resolution textural cues.
High-level branch ( $F_i^{\cdot,high}$ ) captures coarse, global enhancement patterns. Both are equipped with scale-specific phase embeddings and lesion-mask-downsampled MAP pooling. Final predictions are combined via a weighted sum,

$\hat y^{final} = \alpha \hat y^{low} + (1 - \alpha) \hat y^{high}$

with optimal fusion hyperparameter $\alpha \simeq 0.7$ (Uhm et al., 2024).

A similar resolution-aware fusion is found in diabetic retinopathy screening, where the Feature-Preserve Module (FPM) fuses local, bottleneck, and previous-stage lesion features, weighted by an attention map upsampled to match the current resolution (Xia et al., 2024).

4. Training and Adaptation Strategies

Supervised training leverages pixel/voxel-level ground truth when available, e.g., with Dice loss for segmentation (kidney Dice ≈ 0.97, tumor Dice ≈ 0.86) (Uhm et al., 2024). Losses are balanced using schedule and regularization parameters, such as loss weights for multi-scale classification ( $\beta$ ) and attention fusion ( $\lambda$ ).

Training with disparately labeled datasets is enabled by two-path architectures and meta-learning strategies. For instance, anatomy and lesion streams are trained separately from anatomy-only and lesion-only sets, then fused with a small attention net. A test-time meta-learning adaptation procedure realigns the anatomy feature extractor to ignore lesion-disrupted voxels by minimizing a consistency loss on randomized-lesion images (Himmetoglu et al., 24 Mar 2025). This procedure employs a bi-level MAML objective to optimize for robust multi-path segmentation in the presence of lesions.

5. Empirical Evaluation and Ablation Findings

Across datasets, cross-phase lesion-aware attention yields significant improvements over baselines lacking explicit phase modeling.

Key Results:

Model	Setting	AUC	Dice/F1	Lesion Dice (mean)
LACPANet	Renal CT	0.9426	0.7979	-
LANet	DR Segmentation	-	0.475
Attention two-path	MS Lesion Activity	-	65.6
Ours (joint seg)	Brain MRI	-	0.881 (tumor)

Ablations reveal:

Inclusion of phase tokens enhances attention matrix discrimination (+0.6% AUC on CT).
Multi-scale fusion with optimal $\alpha$ outperforms single-scale attention (+2.4% AUC, +13.4% F1).
Residual attention coefficient $\lambda$ optimal near 0.1.
In DR, LANet with LAM+FPM achieves mean AP gains of +7.6%, +2.1%, +1.2% on three datasets. LAM and FPM act synergistically (Xia et al., 2024).
In MS, suppressive attention masking at old lesions and enhancement at new ones is visualized directly; Dice scores and true positive rates match or exceed interrater reliability (Gessert et al., 2020).

6. Mechanistic Insights and Diagnostic Implications

By explicit attention modeling of phase/sequence relationships and lesion focus:

Lesion-aware fusion avoids the mixing of lesion and normal tissue signals, preventing label noise and misclassification endemic to naïve multi-phase or multi-sequence models.
Attention matrices dynamically re-weight phase contributions, automatically resolving ambiguous patterns (e.g., arterial phase dominance in differentiating oncocytoma vs. ccRCC, reinforcement between neighboring phases for subclass separation) (Uhm et al., 2024).
In meta-adapted segmentation, test-time consistency criteria force the anatomy stream to ignore lesion-corrupted voxels, enabling training from unpaired datasets (Himmetoglu et al., 24 Mar 2025).

7. Applications Across Modalities and Pathologies

Cross-phase lesion-aware attention is operationalized in the following principal domains:

Renal multifocal tumor classification on multi-phase CT, leveraging enhancement kinetics (Uhm et al., 2024).
Diabetic retinopathy lesion segmentation and screening from fundus images, utilizing orientation- and global-channel attention (Xia et al., 2024).
Brain tumor (glioblastoma) multi-sequence MRI joint segmentation, enabling learning from anatomy-only and lesion-only data via attention fusion and meta-learning (Himmetoglu et al., 24 Mar 2025).
Longitudinal multiple sclerosis lesion activity detection, leveraging dual encoder paths, suppression of old-lesion regions via learned attention masking, and fusion strategies (Gessert et al., 2020).

The mechanism is integral in contexts where either phase/sequence information or temporal patterns are diagnostic, and where data for healthy anatomy and lesions are disparately labeled. Its adoption yields statistically significant accuracy gains, approaching expert performance levels and supporting reliable decision-making in clinical workflows.