MedAD-38K Benchmark Overview

Updated 8 February 2026

MedAD-38K is a comprehensive benchmark integrating multi-center medical images and structured VQA with stepwise, expert-verified diagnostic reasoning.
The dataset spans 38K images across 10 modalities and 10 anatomical regions, enabling evaluation of tasks like anomaly detection, modality classification, and lesion localization.
A two-stage training paradigm combining supervised fine-tuning and reinforcement learning with consistency rewards yields state-of-the-art, interpretable clinical decision support.

MedAD-38K is a large-scale, multi-modal, and multi-center benchmark designed for Medical Anomaly Detection (MedAD) with an emphasis on structured diagnosis reasoning. It uniquely pairs standardized Visual Question Answering (VQA) tasks with stepwise, expert-verified Chain-of-Thought (CoT) diagnostic rationales, enabling the assessment and training of large multimodal models (LMMs) on both answer accuracy and reasoning consistency. MedAD-38K underpins the evaluation and training protocols for models such as MedAD-R1, which are optimized for interpretable and logically coherent clinical decision support (Zhang et al., 1 Feb 2026).

1. Dataset Composition and Annotation Structure

MedAD-38K comprises approximately 38,000 medical images sourced from 14 publicly available datasets—including BraTS2021, LiverCT, Mosmed, ISIC2018, ChestX-ray8, and BACH—encompassing diverse acquisition protocols and de-identified clinical cohorts. The annotation framework covers 10 medical imaging modalities (MRI, CT, OCT, Ultrasound, Dermoscopy, CE-MRI, Endoscopy, Fundus photography, X-ray, Microscopy) across 10 anatomical regions (Brain, Liver, Retina, Breast, Skin, Lung, Thyroid, Alimentary tract, Chest, Lymph node).

Images are split into training (70%) and test (30%) sets, with no separate validation split detailed; a held-out test set is used for final performance evaluation.

Annotation Types

MedAD-38K integrates two principal annotation modalities:

Structured VQA pairs: Each case is accompanied by five core diagnostic axes—Anatomy Identification, Modality Classification, Anomaly Detection, Pathology Characterization, and Lesion Localization. Each axis uses multiple-choice (four options) formats, with 10 paraphrased question variants per axis to augment linguistic robustness.
Diagnostic CoT annotations: For each instance, a stepwise textual rationale links salient visual features to the diagnostic conclusion using a fixed output format: > …</think><answer>…</answer>. This format enforces transparent reasoning aligned with the final diagnostic label.
Annotation generation employs a semi-automated pipeline: initial visual descriptions are generated by MedGemma, VQA questions are formulated using template-based methods, and the Gemini 2.5 Pro LMM produces initial CoT explanations. All explanations undergo manual verification by clinical annotators who check for logical soundness, image grounding, and absence of hallucinations or circular reasoning. Instances failing these criteria are rejected or revised. No formal inter-annotator agreement or patient demographic breakdown is reported.

2. Benchmark Tasks and Evaluation Metrics

The benchmark framework defines five diagnostic tasks:
1. Anatomy Identification: Classify the anatomical structure.
2. Modality Classification: Identify the imaging modality.
3. Anomaly Detection: Binary classification (normal vs. abnormal).
4. Pathology Characterization: Multi-class lesion type recognition.
5. Lesion Localization: Predict the discrete spatial region (five-region grid).
A sixth aspect, Reasoning Coherence Evaluation, measures whether the generated CoT is logically sufficient for an external evaluator (LMM) to recover the same final answer, internal to RL training.

Evaluation Metrics:
- Primary metric: Accuracy (%) for each task and overall accuracy (total correct answers / total questions).
- RL-specific metrics: During training, three reward components are used:
  - Format Reward $R_{\mathrm{fmt}}$ : Binary reward for presence of correct XML tags.
  - Accuracy Reward $R_{\mathrm{acc}}$ : 1 if generated answer matches the gold label.
  - Consistency Reward $R_{\mathrm{con}}$ : 1 if an external LMM given the reasoning produces the correct answer.
  - No BLEU, ROUGE, or F1 metrics are reported for final evaluation.
3. Two-Stage Training Paradigm

MedAD-38K serves as the foundation for a two-stage training framework used in MedAD-R1, comprising supervised fine-tuning (Cognitive Injection) and reinforcement learning with consistency-driven objectives.

3.1 Cognitive Injection (Supervised Fine-Tuning)

The initial stage uses cross-entropy loss to teach foundational medical knowledge and enforce a “think-then-answer” output structure. Model output is strictly formatted as <think>…stepwise reasoning…<answer>…final diagnosis…</answer>, aligning with human-readable diagnostic process.

3.2 Consistency Group Relative Policy Optimization (Con-GRPO)

The second stage applies RL, maximizing the expected reward $J(\theta)$ over group-sampled generations. The group-relative advantage calculates each sampled output’s reward relative to the group average, using a group size $G=8$ . The core objective balances the probability ratio (per token) and a KL penalty to discourage drift from the SFT-initialized policy, with clipping threshold $\epsilon=0.1$ and KL coefficient $\beta=0.04$ .

The reward signal is averaged across the three components (format, accuracy, consistency), directly incentivizing logically coherent, valid, and format-compliant reasoning chains.

4. MedAD-R1: Model Specifications and Experimental Protocol

The MedAD-R1 model employs a Qwen2.5-VL-3B (3B parameter) backbone featuring a multimodal architecture: a vision encoder projects image features into a cross-attended LLM embedding space, with an autoregressive LLM head generating both CoT and answer tags.

Training protocol and hyperparameters:

Stage 1: LoRA adapters, 2 epochs, learning rate $1 \times 10^{-4}$ , batch size 64, bfloat16 precision.
Stage 2: 1 epoch, learning rate $1 \times 10^{-6}$ , group size $G=8$ , $\beta=0.04$ (KL penalty), implemented in PyTorch with AdamW optimizer.

5. Benchmark Outcomes and Ablative Insights

5.1 Quantitative Results

MedAD-R1 demonstrates state-of-the-art (SOTA) performance across all five diagnostic axes on MedAD-38K, substantially surpassing both comparably sized and much larger LMMs (up to 72B parameters):

Task	MedAD-R1 (%)	Best Baseline (%)	Δ (%)
Anatomy Identification	98.87	96.18 (Lingshu)	+2.69
Anomaly Detection	78.24	59.94 (Grok4)	+18.30
Lesion Localization	55.90	37.46 (Qwen72B)	+18.44
Modality Classification	97.14	96.59 (GLM-4.1V)	+0.55
Pathology Characterization	79.49	74.63 (MedVLM)	+4.86
Overall Accuracy	85.15	77.00 (Grok4)	+8.15

Compared to a same-backbone LMM, the MedAD-R1 training approach yields a +13.74 pp increment in overall accuracy.

5.2 Ablation Studies

Paradigm Ablation: RL-only yields 73.22%; SFT-only yields 75.41%; SFT+GRPO (accuracy reward only) reaches 78.85%; SFT+GRPO (consistency reward only) achieves 81.73%; full (balanced) MedAD-R1 achieves 85.15%.
Reward Weighting: Format-focused (λ_fmt=0.8) gives 77.13%; accuracy-focused (λ_acc=0.8) 82.54%; consistency-focused (λ_con=0.8) 84.21%; balanced (λ_fmt=λ_acc=λ_con=1/3) 85.15%.

This suggests that multi-objective reward balancing and explicit consistency incentivization both play crucial roles.

5.3 Qualitative Examples

Qualitative assessment shows that MedAD-R1 produces verifiable, stepwise diagnostic chains—e.g., for “T2-weighted image shows peritumoral edema ... Contrast enhancement pattern ... <answer>Glioblastoma</answer>.” External LMM evaluators consistently align the reasoning chain with the answer, evidencing logical coherence.

6. Significance, Strengths, and Limitations

MedAD-38K constitutes the first benchmark at scale to merge structured VQA and verified diagnostic CoT in a multi-modal MedAD context. The benchmark, along with the Con-GRPO RL paradigm, enables models to achieve high performance not just in answer accuracy but in the transparency and logical verifiability of their predictions.

Key Strengths:

Large, heterogeneous, and clinically validated multimodal dataset
Consistency-driven RL yields SOTA results, particularly for complex reasoning (anomaly detection, localization)
Verifiable, transparent reasoning chains support clinical trust and AI auditability
Training efficiency (3B parameters) makes deployment feasible in real-world clinical workflows

Limitations & Open Directions:

Patient demographics and center-level metadata are not reported, potentially limiting downstream bias analysis
Evaluation of consistency relies on an external LMM; future development may explore internalized logical consistency modules
Lesion localization performance remains modest, indicating the need for finer spatial modeling or pixel-wise labels
Absence of inter-annotator agreement metrics; future versions may report kappa statistics for annotation quality control

A plausible implication is that benchmarks following this format could accelerate the development of trustworthy, interpretable medical AI systems capable of transparent anomaly detection and structured diagnostic reasoning (Zhang et al., 1 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MedAD-R1: Eliciting Consistent Reasoning in Interpretible Medical Anomaly Detection via Consistency-Reinforced Policy Optimization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedAD-38K Benchmark.