MAVIC RL: Morphology-Consistent Medical AI

Updated 26 February 2026

The paper introduces MAVIC RL, a novel reinforcement learning framework that decomposes the reward into diagnosis accuracy, taxonomy alignment, morphology agreement, and format validity.
The methodology employs a group relative policy-gradient loss with a gated morphology reward to ensure model predictions align with explicit morphological features and audit trails.
Empirical evaluations on DermoBench show that MAVIC RL improves diagnostic consistency and reasoning quality while mitigating shortcutting and hallucinations.

Morphologically-Anchored Visual-Inference-Consistent (MAVIC) reinforcement learning (RL) is a domain-adapted RL objective developed for multimodal LLMs (MLLMs) in dermatological reasoning, introduced in the context of the DermoGPT system. MAVIC fundamentally reframes the RL landscape in medical AI from reward structures focused solely on end-task accuracy to those enforcing verifiable consistency across the full diagnostic workflow: morphological observation, structured description, clinical reasoning, and diagnosis. This is accomplished through a multidimensional, morphology-anchored reward mechanism that penalizes shortcutting and incentivizes interpretable, auditable diagnostic chains grounded in explicit intermediate representations (Ru et al., 5 Jan 2026).

1. Motivation and Distinction from Standard RL

Conventional MLLM RL objectives in the medical domain often prioritize the final diagnostic label, neglecting the intermediate steps typical of specialist workflows. This can lead to models exploiting spurious correlations, omitting morphological assessments, and generating hallucinated rationales. Clinical experts, conversely, rely on a serial pipeline: visual assessment, structured morphological description, reasoned interpretation, and ultimately diagnostic conclusion. MAVIC addresses this by decomposing the reward into accuracy (diagnosis), taxonomy-level alignment (ontology closeness), morphology agreement (feature-level JSON structure), and format validity (output well-formedness). By anchoring the reward to a "concept bottleneck" such as Derm7pt or SkinCon JSON, MAVIC enforces a morphology-first chain of inference, mitigating ungrounded predictions and enabling explicit auditability.

2. Formal RL Components and Objective

2.1 State, Actions, and Rollout

State ( $s_t$ ): Composed of visual features extracted by a frozen vision tower ( $v = \phi(\text{image})$ ), context/instruction ( $c$ ), and previously generated output tokens ( $o_{<t}$ ).
Action ( $a_t$ ): Emission of the next token, which may belong to structured morphology JSON, intermediate natural language reasoning, or the final diagnosis.

2.2 Reward Decomposition

For each rollout $\tau$ , MAVIC computes four terms:

$R_{\text{acc}}(\tau)\in\{0,1\}$ : Final diagnosis accuracy.
$S_{\text{hier}}(\tau)\in[0,1]$ : Wu–Palmer taxonomy-based similarity,

$S_{\text{hier}} = \frac{2\,\text{depth}(\text{LCA}(\text{path}_\text{pred},\,\text{path}_\text{gt}))}{|\text{path}_\text{pred}| + |\text{path}_\text{gt}|}$

$S_{\text{morph}}(\tau)\in[0,1]$ : PMI-weighted Tversky index over predicted vs. ground-truth morphological JSON. Given $TP$ , $FP$ , $FN$ weighted by featurewise PMI $w_f(y)$ ,

$S_{\text{morph}} = \frac{TP}{TP + \alpha \cdot FP + \beta \cdot FN},\qquad \alpha=0.7,\,\beta=0.3$

$R_{\text{fmt}}(\tau)\in\{0,1\}$ : Format validity (presence and well-formedness of tags such as $<$ morph $>$ , $<$ final_diagnosis $>$ ).

Gating Function: To prevent degenerate solutions (e.g., outputting plausible morphologies for wrong diagnoses), $S_{\text{morph}}$ is "unlocked" only when $S_{\text{hier}}$ exceeds a batch median $\mu$ ,

$g(S_{\text{hier}}) = \sigma(k(S_{\text{hier}} - \mu)), \quad k=10$

where $\sigma$ denotes the sigmoid function.

2.3 Aggregate MAVIC Reward

Total per-rollout reward:

$R_{\text{MAVIC}} = R_{\text{acc}} + \lambda_{\text{hier}} S_{\text{hier}} + \lambda_{\text{morph}}\,g(S_{\text{hier}}) S_{\text{morph}} + R_{\text{fmt}}$

Default hyperparameters: $\lambda_{\text{hier}} = \lambda_{\text{morph}} = 1$ .

2.4 Group Relative Policy-Gradient (GRPO) Loss

Let $\pi_\theta$ be the policy, $N$ the batch size, and $G$ the number of trajectories per data point. The baseline for each datum is the average reward over its $G$ rollouts:

$\overline{R}_i = \frac{1}{G}\sum_{g=1}^G R^g_i$

The loss minimized is:

$\mathcal{L}_{\text{RL}}(\theta) = - \frac{1}{N G} \sum_{i=1}^N \sum_{g=1}^G \log\pi_\theta(\tau^g_i)[R^g_i - \overline{R}_i] + \beta_{\text{kl}} \,\text{KL}[\pi_\theta(\cdot|c_i) \| \pi_{\text{ref}}(\cdot|c_i)]$

where $\beta_{\text{kl}} = 0.1$ and $\pi_{\text{ref}}$ is the policy at the start of RL.

3. Fine-Tuning Algorithm and Hyperparameters

Pseudocode Summary

initialize θ ← SFT checkpoint
for epoch in 1..1:
    for minibatch {(image_i, instr_i, y_i)}:
        for i in 1..N:
            v_i = vision_tower(image_i)
            c_i = construct_prompt(instr_i)
            τ^1_i,...,τ^G_i ~ π_θ(·|v_i, c_i)
            for each g:
                extract diagnosis, morph-JSON from τ^g_i
                compute component rewards
        compute baselines
        compute L_RL and update LoRA params (AdamW, lr=1e-6)

Key Hyperparameters

Group size $G=8$
Generation: temperature=1.0, top-p=1.0, top-k=50
LoRA (second-stage): rank=16, $\alpha=32$ , dropout=0.05
Learning rate: $1\times 10^{-6}$ , cosine schedule with 3% warmup

4. Training Integration and Workflow

The complete DermoGPT-RL training sequence follows a two-stage paradigm:

Supervised Fine-Tuning (SFT):
- Data: 646,000 DermoInstruct pairs, spanning free-text, JSON, CoT, flat diagnosis, and hierarchy formats.
- Loss: Cross-entropy.
- Train vision and merger modules, freeze LLM, LoRA rank=64, $\alpha=64$ , learning rate $1\times 10^{-4}$ .
MAVIC RL:
- Initialize from SFT checkpoint.
- Freeze vision, merger, LLM; update only LoRA adapters (rank=16).
- Optimize MAVIC objective for one epoch.

Final model, DermoGPT-RL, can be adapted at inference using a Confidence-Consistency Test-time adaptation (CCT) mechanism for improved robustness.

5. Consistency Enforcement and Concept Bottlenecks

MAVIC enforces explicit consistency constraints across the inference pipeline:

Concept Bottleneck: Emission of structured morphology JSON at defined “bottleneck” points grants auditability of intermediate concepts, exposing model reasoning for inspection.
Gated Morphology Reward: The morphology reward is contingent on hierarchical (diagnosis-level) correctness, preventing optimization toward plausible but incorrect intermediate representations.
Format Checking: $R_{\text{fmt}}$ penalizes unstructured, incomplete, or malformed outputs, ensuring audit trails are parseable and semantically faithful.
PMI-Weighted Tversky: The use of pointwise mutual information (PMI) for feature weights in $S_{\text{morph}}$ assigns greater importance to rare but diagnostically salient attributes, sharpening morphological reasoning.

6. Empirical Evaluation and Analysis

6.1 Ablation Study

Setting	T1.1	T1.2	T3.1	T3.2
SFT only	41.74	49.11	62.57	63.34
GRPO (acc+fmt)	35.13	41.20	61.34	59.88
w/o $S_{\text{morph}}$	39.65	48.09	65.40	65.27
w/o $S_{\text{hier}}$	42.59	50.11	63.96	65.02
w/o gating ( $g\equiv1$ )	43.26	56.03	66.71	63.89
PMI $\to$ uniform	42.56	56.98	57.32	56.64
Full MAVIC	43.93	59.29	66.04	65.48

Naïve RL (acc+fmt) reduces reasoning quality.
Incorporating either $S_{\text{hier}}$ or $S_{\text{morph}}$ alone provides limited gains.
Only the complete scheme (w/gating, PMI) yields robust, superior performance.

6.2 DermoBench Results (Selected)

In-Distribution Diagnosis Avg: SFT 77.25 % $\to$ RL 79.12 % $\to$ RL+CCT 79.12 %
OOD Diagnosis Avg: 64.19 % $\to$ RL 65.27 % $\to$ RL+CCT 66.48 %
Reasoning Axis (T3 Avg): 62.95 % $\to$ RL 65.48 % $\to$ RL+CCT 67.19 %
Morphology JSON (T1.2): 49.11 % $\to$ RL 59.29 % $\to$ RL+CCT 60.33 %

6.3 Qualitative Example

Whereas LLMs such as Gemini-2.5-Flash hallucinated features not present in benign lesions (e.g., “blue-white veil”), DermoGPT-RL omitted extraneous descriptors and aligned its reasoning to visual evidence.

7. Limitations and Future Research Directions

Human–AI Gap: DermoGPT-RL+CCT remains approximately 20–30 points below expert dermatologists on open-ended tasks, highlighting the challenge of capturing subtle visual nuance and reasoning.
Reward Design Complexity: MAVIC's dependence on explicit ontology, PMI tables, batch gating, and format validation complicates extension to other specialties.
Scalability: The group-trajectory approach (G=8) is resource-intensive; off-policy or value-based RL methods may offer efficiency improvements.
Multi-Turn Reasoning: Present training is restricted to single-turn chains-of-thought. Adapting MAVIC for interactive, multi-turn clinical Q&A is an open area.
Dynamic Reward Weighting: Fixed $\lambda_{\text{hier}}$ and $\lambda_{\text{morph}}$ may not be optimal for every condition; meta-learning or bandit algorithms could dynamically optimize these weights.

This suggests that the MAVIC framework marks a transition in medical MLLM training—from terminal-only accuracy rewards toward full inference-path supervision—improving transparency, interpretability, and morphology-grounded diagnostic robustness in high-stakes clinical AI (Ru et al., 5 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Morphologically-Anchored Visual-Inference-Consistent (MAVIC) RL.

MAVIC RL: Morphology-Consistent Medical AI

1. Motivation and Distinction from Standard RL

2. Formal RL Components and Objective

2.1 State, Actions, and Rollout

2.2 Reward Decomposition

2.3 Aggregate MAVIC Reward

2.4 Group Relative Policy-Gradient (GRPO) Loss

3. Fine-Tuning Algorithm and Hyperparameters

Pseudocode Summary

Key Hyperparameters

4. Training Integration and Workflow

5. Consistency Enforcement and Concept Bottlenecks

6. Empirical Evaluation and Analysis

6.1 Ablation Study

6.2 DermoBench Results (Selected)

6.3 Qualitative Example

7. Limitations and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MAVIC RL: Morphology-Consistent Medical AI

1. Motivation and Distinction from Standard RL

2. Formal RL Components and Objective

2.1 State, Actions, and Rollout

2.2 Reward Decomposition

2.3 Aggregate MAVIC Reward

2.4 Group Relative Policy-Gradient (GRPO) Loss

3. Fine-Tuning Algorithm and Hyperparameters

Pseudocode Summary

Key Hyperparameters

4. Training Integration and Workflow

5. Consistency Enforcement and Concept Bottlenecks

6. Empirical Evaluation and Analysis

6.1 Ablation Study

6.2 DermoBench Results (Selected)

6.3 Qualitative Example

7. Limitations and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research