MAVIC RL: Morphology-Consistent Medical AI
- The paper introduces MAVIC RL, a novel reinforcement learning framework that decomposes the reward into diagnosis accuracy, taxonomy alignment, morphology agreement, and format validity.
- The methodology employs a group relative policy-gradient loss with a gated morphology reward to ensure model predictions align with explicit morphological features and audit trails.
- Empirical evaluations on DermoBench show that MAVIC RL improves diagnostic consistency and reasoning quality while mitigating shortcutting and hallucinations.
Morphologically-Anchored Visual-Inference-Consistent (MAVIC) reinforcement learning (RL) is a domain-adapted RL objective developed for multimodal LLMs (MLLMs) in dermatological reasoning, introduced in the context of the DermoGPT system. MAVIC fundamentally reframes the RL landscape in medical AI from reward structures focused solely on end-task accuracy to those enforcing verifiable consistency across the full diagnostic workflow: morphological observation, structured description, clinical reasoning, and diagnosis. This is accomplished through a multidimensional, morphology-anchored reward mechanism that penalizes shortcutting and incentivizes interpretable, auditable diagnostic chains grounded in explicit intermediate representations (Ru et al., 5 Jan 2026).
1. Motivation and Distinction from Standard RL
Conventional MLLM RL objectives in the medical domain often prioritize the final diagnostic label, neglecting the intermediate steps typical of specialist workflows. This can lead to models exploiting spurious correlations, omitting morphological assessments, and generating hallucinated rationales. Clinical experts, conversely, rely on a serial pipeline: visual assessment, structured morphological description, reasoned interpretation, and ultimately diagnostic conclusion. MAVIC addresses this by decomposing the reward into accuracy (diagnosis), taxonomy-level alignment (ontology closeness), morphology agreement (feature-level JSON structure), and format validity (output well-formedness). By anchoring the reward to a "concept bottleneck" such as Derm7pt or SkinCon JSON, MAVIC enforces a morphology-first chain of inference, mitigating ungrounded predictions and enabling explicit auditability.
2. Formal RL Components and Objective
2.1 State, Actions, and Rollout
- State (): Composed of visual features extracted by a frozen vision tower (), context/instruction (), and previously generated output tokens ().
- Action (): Emission of the next token, which may belong to structured morphology JSON, intermediate natural language reasoning, or the final diagnosis.
2.2 Reward Decomposition
For each rollout , MAVIC computes four terms:
- : Final diagnosis accuracy.
- : Wu–Palmer taxonomy-based similarity,
- : PMI-weighted Tversky index over predicted vs. ground-truth morphological JSON. Given , , weighted by featurewise PMI ,
- : Format validity (presence and well-formedness of tags such as morph, final_diagnosis).
Gating Function: To prevent degenerate solutions (e.g., outputting plausible morphologies for wrong diagnoses), is "unlocked" only when exceeds a batch median ,
where denotes the sigmoid function.
2.3 Aggregate MAVIC Reward
Total per-rollout reward:
Default hyperparameters: .
2.4 Group Relative Policy-Gradient (GRPO) Loss
Let be the policy, the batch size, and the number of trajectories per data point. The baseline for each datum is the average reward over its rollouts:
The loss minimized is:
where and is the policy at the start of RL.
3. Fine-Tuning Algorithm and Hyperparameters
Pseudocode Summary
1 2 3 4 5 6 7 8 9 10 11 12 |
initialize θ ← SFT checkpoint for epoch in 1..1: for minibatch {(image_i, instr_i, y_i)}: for i in 1..N: v_i = vision_tower(image_i) c_i = construct_prompt(instr_i) τ^1_i,...,τ^G_i ~ π_θ(·|v_i, c_i) for each g: extract diagnosis, morph-JSON from τ^g_i compute component rewards compute baselines compute L_RL and update LoRA params (AdamW, lr=1e-6) |
Key Hyperparameters
- Group size
- Generation: temperature=1.0, top-p=1.0, top-k=50
- LoRA (second-stage): rank=16, , dropout=0.05
- Learning rate: , cosine schedule with 3% warmup
4. Training Integration and Workflow
The complete DermoGPT-RL training sequence follows a two-stage paradigm:
- Supervised Fine-Tuning (SFT):
- Data: 646,000 DermoInstruct pairs, spanning free-text, JSON, CoT, flat diagnosis, and hierarchy formats.
- Loss: Cross-entropy.
- Train vision and merger modules, freeze LLM, LoRA rank=64, , learning rate .
- MAVIC RL:
- Initialize from SFT checkpoint.
- Freeze vision, merger, LLM; update only LoRA adapters (rank=16).
- Optimize MAVIC objective for one epoch.
Final model, DermoGPT-RL, can be adapted at inference using a Confidence-Consistency Test-time adaptation (CCT) mechanism for improved robustness.
5. Consistency Enforcement and Concept Bottlenecks
MAVIC enforces explicit consistency constraints across the inference pipeline:
- Concept Bottleneck: Emission of structured morphology JSON at defined “bottleneck” points grants auditability of intermediate concepts, exposing model reasoning for inspection.
- Gated Morphology Reward: The morphology reward is contingent on hierarchical (diagnosis-level) correctness, preventing optimization toward plausible but incorrect intermediate representations.
- Format Checking: penalizes unstructured, incomplete, or malformed outputs, ensuring audit trails are parseable and semantically faithful.
- PMI-Weighted Tversky: The use of pointwise mutual information (PMI) for feature weights in assigns greater importance to rare but diagnostically salient attributes, sharpening morphological reasoning.
6. Empirical Evaluation and Analysis
6.1 Ablation Study
| Setting | T1.1 | T1.2 | T3.1 | T3.2 |
|---|---|---|---|---|
| SFT only | 41.74 | 49.11 | 62.57 | 63.34 |
| GRPO (acc+fmt) | 35.13 | 41.20 | 61.34 | 59.88 |
| w/o | 39.65 | 48.09 | 65.40 | 65.27 |
| w/o | 42.59 | 50.11 | 63.96 | 65.02 |
| w/o gating () | 43.26 | 56.03 | 66.71 | 63.89 |
| PMIuniform | 42.56 | 56.98 | 57.32 | 56.64 |
| Full MAVIC | 43.93 | 59.29 | 66.04 | 65.48 |
- Naïve RL (acc+fmt) reduces reasoning quality.
- Incorporating either or alone provides limited gains.
- Only the complete scheme (w/gating, PMI) yields robust, superior performance.
6.2 DermoBench Results (Selected)
- In-Distribution Diagnosis Avg: SFT 77.25 % RL 79.12 % RL+CCT 79.12 %
- OOD Diagnosis Avg: 64.19 % RL 65.27 % RL+CCT 66.48 %
- Reasoning Axis (T3 Avg): 62.95 % RL 65.48 % RL+CCT 67.19 %
- Morphology JSON (T1.2): 49.11 % RL 59.29 % RL+CCT 60.33 %
6.3 Qualitative Example
Whereas LLMs such as Gemini-2.5-Flash hallucinated features not present in benign lesions (e.g., “blue-white veil”), DermoGPT-RL omitted extraneous descriptors and aligned its reasoning to visual evidence.
7. Limitations and Future Research Directions
- Human–AI Gap: DermoGPT-RL+CCT remains approximately 20–30 points below expert dermatologists on open-ended tasks, highlighting the challenge of capturing subtle visual nuance and reasoning.
- Reward Design Complexity: MAVIC's dependence on explicit ontology, PMI tables, batch gating, and format validation complicates extension to other specialties.
- Scalability: The group-trajectory approach (G=8) is resource-intensive; off-policy or value-based RL methods may offer efficiency improvements.
- Multi-Turn Reasoning: Present training is restricted to single-turn chains-of-thought. Adapting MAVIC for interactive, multi-turn clinical Q&A is an open area.
- Dynamic Reward Weighting: Fixed and may not be optimal for every condition; meta-learning or bandit algorithms could dynamically optimize these weights.
This suggests that the MAVIC framework marks a transition in medical MLLM training—from terminal-only accuracy rewards toward full inference-path supervision—improving transparency, interpretability, and morphology-grounded diagnostic robustness in high-stakes clinical AI (Ru et al., 5 Jan 2026).