Anatomical Similarity Curriculum Learning

Updated 29 December 2025

The paper introduces ASCL, a progressive training method that leverages anatomical hierarchy to structure learning for both 3D segmentation and multimodal reasoning.
It employs a coarse-to-fine paradigm where dominant structures are learned first, mitigating class imbalance and semantic confusion in complex datasets.
Empirical results demonstrate significant improvements in Dice scores and reasoning accuracy, validating ASCL’s effectiveness in medical image segmentation and VQA tasks.

Anatomical Similarity Curriculum Learning (ASCL) is a principled curriculum or progressive training paradigm for both medical image segmentation and multimodal foundational models, in which inter-sample semantic similarity and anatomical hierarchy govern example order, supervision, and task structure. These curricula are deployed to address severe class imbalance, high intra/inter-class semantic confusability, and domain-specific reasoning challenges that preclude conventional supervised or reinforcement-based optimization from converging efficiently or stably. ASCL has recently been instantiated in both 3D vascular segmentation with explicit tree-structured hierarchies (Shi, 18 Nov 2025) and multimodal LLMs (MLLMs) for anatomical reasoning via similarity-based curriculum design (Song et al., 22 Dec 2025).

1. Foundational Concepts and Motivations

ASCL leverages the semantic or anatomical similarity between classes, structures, or answers to design an ordering or hierarchical partition of training tasks. In imaging, the curriculum is constructed “coarse-to-fine” according to the anatomical tree, ensuring early learning of high-prevalence, low-ambiguity structures before introducing rare or confusable branches (Shi, 18 Nov 2025). In multimodal LLMs, the curriculum is defined with respect to the difficulty of multiple-choice anatomy-related questions, with “difficulty” quantified by the maximum semantic similarity between the correct option and the most challenging distractor (Song et al., 22 Dec 2025). This ordering accelerates knowledge transfer from simple to complex subproblems, mitigates early model collapse, and enables more effective handling of inherent class or label imbalance.

ASCL is motivated by cognitive and didactic analogies, drawing explicitly on the human learning paradigm of “simple-to-complex” curriculum structures. In both 3D segmentation and anatomy VQA, the curriculum is precisely tied to domain-specific semantic, anatomical, or ontological structure, rather than imposed arbitrarily or via generic heuristics.

2. Curriculum Construction: Anatomical Hierarchy and Similarity Metrics

In multi-class vascular segmentation (Shi, 18 Nov 2025), the aortic tree is decomposed into an $L$ -level directed tree $\mathcal{T}=(\mathcal{V},\mathcal{E})$ , with root-to-leaf path length reflecting coarse-to-fine anatomical granularity. The curriculum initially supervises only the top (root and major trunks), delaying the introduction of minor branches (e.g., carotids, iliacs) until plateauing performance at higher levels. Each stage presents the network with a less imbalanced variant of the full segmentation problem, concentrating learning signal on dominant anatomical classes before tackling sparse, topologically critical subclasses.

For anatomy reasoning in MLLMs (Song et al., 22 Dec 2025), semantic similarity between answer choices is measured by embedding each candidate via a frozen medical text encoder (MedCLIP). Pairwise cosine similarity $\mathrm{sim}(v_c, v_i)$ provides a scalar measure of conceptual proximity. The curriculum assigns each question a difficulty score $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ , partitioning the training corpus into $K$ bins ordered from easy (low $S(q)$ ) to hard (high $S(q)$ ).

Domain	Curriculum Partitioning	Similarity Metric/Hierarchy
3D segmentation	Tree levels (root $\to$ leaves)	Explicit anatomical hierarchy $\mathcal{T}$
MLLM reasoning	Question bins (easy $\to$ hard)	$\mathcal{T}=(\mathcal{V},\mathcal{E})$ 0 cosine similarity between candidates (MedCLIP)

This schema enables staged progression, with more challenging instances introduced only after satisfactory performance on lower-difficulty bins or hierarchy levels.

3. Mathematical Formalism and Objective Functions

3D segmentation (aorta) utilizes a fractal softmax construction to encode anatomical hierarchy directly into the output probabilities at each tree level. For each level $\mathcal{T}=(\mathcal{V},\mathcal{E})$ 1,

The network outputs logits for leaf classes, and
For coarser levels, child logits are aggregated by pointwise $\mathcal{T}=(\mathcal{V},\mathcal{E})$ 2, enforcing $\mathcal{T}=(\mathcal{V},\mathcal{E})$ 3-consistency:

$\mathcal{T}=(\mathcal{V},\mathcal{E})$ 4

A softmax over each level yields $\mathcal{T}=(\mathcal{V},\mathcal{E})$ 5:

$\mathcal{T}=(\mathcal{V},\mathcal{E})$ 6

Supervision is imposed by a hierarchical semantic loss incorporating cross-entropy, Dice, and centerline-boundary Dice terms, with level weights ( $\mathcal{T}=(\mathcal{V},\mathcal{E})$ 7) emphasizing coarse-to-fine progression.

MLLM anatomical reasoning deploys Group Relative Policy Optimization (GRPO):

For each question and answer group, compute group-normalized advantage:

$\mathcal{T}=(\mathcal{V},\mathcal{E})$ 8

Objective (for policy $\mathcal{T}=(\mathcal{V},\mathcal{E})$ 9):

$\mathrm{sim}(v_c, v_i)$ 0

Progression through curriculum bins is conditional on the running average reward in each bin reaching a pre-set threshold.

4. Training Protocols and Implementation

In 3D segmentation (Shi, 18 Nov 2025), training proceeds in two stages:

Stage 1: Binary nnU-Net on downsampled volume (SGD, Nesterov, initial lr = 0.01, batch = 2, 500 epochs)
Stage 2: Multi-class nnU-Net with fractal softmax, curriculum schedule:
- Epochs 0–100: Levels 1–2 supervised ( $\mathrm{sim}(v_c, v_i)$ 1)
- 100–250: Add Level 3
- 250–400: Add Level 4
- $\mathrm{sim}(v_c, v_i)$ 2400: Full hierarchy with all $\mathrm{sim}(v_c, v_i)$ 3
By epoch 50, coarse structures are robustly learned, allowing fine branches to be addressed without catastrophic forgetting or class imbalance issues.

In MLLM reasoning (Song et al., 22 Dec 2025), ASC-Learning operates as follows:

Compute $\mathrm{sim}(v_c, v_i)$ 4 for each question.
Partition into $\mathrm{sim}(v_c, v_i)$ 5 bins, e.g., by quantiles.
For each bin $\mathrm{sim}(v_c, v_i)$ 6, train using GRPO until average reward $\mathrm{sim}(v_c, v_i)$ 7 surpasses threshold $\mathrm{sim}(v_c, v_i)$ 8, then proceed to the next bin.
Key hyperparameters: $\mathrm{sim}(v_c, v_i)$ 9, $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 0, $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 1, $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 2, $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 3, group size $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 4, batch size, number of RL updates per stage, all validated empirically.

Group Diversity Question Augmentation (GDQA) (Song et al., 22 Dec 2025) is deployed to prevent vanishing gradients, expanding each original question into $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 5 text/image-augmented variants, ensuring group-level variance in reward and continued learning signal during RL training.

5. Inference Strategies and Acceleration

For segmentation (Shi, 18 Nov 2025), a two-stage inference pipeline is employed:

Lightweight binary segmentation to localize and crop the region of interest (ROI)
Full-resolution, multi-class segmentation with fractal softmax focused on the ROI.

This yields up to 5× acceleration in inference time (22s on 1× margin ROI vs. 115s for whole volume), with accuracy maintained.

MLLM-based reasoning strategies do not involve staged inference but rely on post-curriculum policies with improved stability and accuracy, as evidenced by gradual reward improvement across curriculum bins.

6. Quantitative Results

In aortic segmentation (Shi, 18 Nov 2025):

Validation (epoch 50): ASCL w/ hierarchy, Dice: 70.15% (+11.65% over baseline), NSD: 65.87% (+13.56%)
Final 40-scan test: Dice 77.9% vs. CIS-UNet 72.3% (Δ=+5.6%)
Inference: two-stage pipeline ( $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 6) achieves 22s per volume vs. 115s baseline.

In anatomy VQA (Song et al., 22 Dec 2025) (SGG-VQA Benchmark):

Qwen-2.5-VL-3B: Zero-shot Pass@1: 25.9% → SFT: 35.9% → GRPO-ASC: 42.7% → GDQA: 42.8%
Avg@5: 26.7% (ZS) → 34.1% (SFT) → 40.2% (ASC) → 43.4% (GDQA)
Qwen-2.5-VL-7B: Pass@1: up to 47.1% (GDQA) vs. 29.6% (ZS) and 36.1% (SFT).

Ablation studies in both domains confirm that curriculum progression—especially with similarity- or anatomy-informed ordering—outperforms flat or non-curriculum training, both in convergence speed and final performance.

7. Implementation and Reproducibility

Segmentation code is available via https://github.com/PengchengShi1220/AortaSeg24; the fractal softmax module is forthcoming (Shi, 18 Nov 2025). Experiments use PyTorch 2.2.2, CUDA 12.1, and nnU-Net V2, with GPU deployments on NVIDIA RTX 3090 (24 GB). MLLM code and data are hosted at https://github.com/tomato996/Anatomy-R1 (Song et al., 22 Dec 2025); implementations use the MS-SWIFT RL framework on 4 × NVIDIA A800 GPUs.

Loss hyperparameters are $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 7, $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 8, $S(q) = \max_{d_i \in D(q)} \mathrm{sim}(E(o_c), E(d_i))$ 9; hierarchy level weights $K$ 0 ramp linearly from root to leaf. MLLM curriculum parameters (e.g., bin counts, reward thresholds) are tuned on held-out sets, with group size $K$ 1 and RL optimization settings chosen for practical convergence.

8. Significance and Prospects

ASCL unifies curriculum learning with domain-specific semantic or structural priors, delivering marked improvements in class imbalance mitigation, convergence acceleration, and topological faithfulness in medical segmentation, as well as more robust, non-collapsed optimization in multimodal anatomical reasoning tasks. Empirical gains include up to 11.7% absolute Dice score improvement (intermediate epochs), 5.6% on final aorta test sets, and 10–17% accuracy improvement in MLLM Pass@1 for surgical anatomy VQA.

These results suggest that anatomical or semantic similarity-driven curricula can be generalized to a range of medical AI domains where label sparsity, structure, or high-order class confusability pose unique challenges. Future work may refine curriculum granularity, extend hierarchical modeling, or integrate dynamic curriculum adjustments based on real-time learning signal.

Principal References:

"Hierarchical Semantic Learning for Multi-Class Aorta Segmentation" (Shi, 18 Nov 2025)
"Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal LLMs via Anatomical Similarity Curriculum and Group Diversity Augmentation" (Song et al., 22 Dec 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Hierarchical Semantic Learning for Multi-Class Aorta Segmentation (2025)

Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anatomical Similarity Curriculum Learning.