DermoGPT Benchmark for Dermatology MLLMs

Updated 31 January 2026

DermoGPT Benchmark is a clinically driven evaluation framework that advances dermatology MLLMs by mirroring the complete morphology, reasoning, and diagnosis workflow.
It consolidates multi-modal tasks with expert-curated datasets and hierarchical annotation schemas to improve both diagnostic accuracy and interpretability.
The framework emphasizes fairness and reproducibility, assessing model performance across diverse skin types using both closed-ended metrics and open-ended narrative evaluations.

The DermoGPT Benchmark refers to a series of rigorous, clinically-grounded diagnostic evaluation suites for dermatology multimodal LLMs (MLLMs), consolidated under the name “DermoBench” and tightly integrated with the training and assessment protocols used in DermoGPT and related systems. These benchmarks are designed to facilitate a shift from conventional lesion-level classification toward morphology-anchored, multi-step reasoning in machine learning for dermatology. DermoBench and its derivatives provide multi-axis, expert-verified evaluation frameworks that test model competence across the diagnostic workflow — from granular visual description to complex inferential clinical reasoning and final diagnosis — with an explicit focus on fairness, reproducibility, and fidelity to dermatological expertise (Yilmaz et al., 20 Jan 2026, Shen et al., 12 Nov 2025, Shen et al., 19 Nov 2025, Ru et al., 5 Jan 2026).

1. Clinical Motivation and Benchmark Evolution

Conventional dermatology AI systems have largely been evaluated on image-level recognition, typically via single-label or multi-label classification of lesion types. This approach is insufficient for the comprehensive assessment of MLLMs intended for clinical diagnostic support, as it ignores morphology parsing, language grounding, diagnostic reasoning, and narrative report generation. The DermoGPT Benchmark family addresses this gap by structuring evaluation around the complete “morphology → reasoning → diagnosis” pipeline that mirrors expert workflows (Ru et al., 5 Jan 2026).

The earliest foundations of this paradigm, represented by DermaBench (Yilmaz et al., 20 Jan 2026), introduced VQA-style (visual question answering) protocols built on rich, expert-curated annotations; subsequent efforts such as those in DermoGPT (Ru et al., 5 Jan 2026), DermBench (Shen et al., 12 Nov 2025), and the SkinGPT-R1 assessment suite (Shen et al., 19 Nov 2025) generalized these concepts into large-scale, multi-task and multi-axis benchmarks. Each iteration increased dataset scale, annotation richness, and the diversity of both tasks and metrics.

2. Dataset Composition, Annotation, and Schema

Data Sources

DermaBench (Yilmaz et al., 20 Jan 2026): 656 clinical and dermoscopic images from 570 unique patients, curated for diversity in skin tone (Fitzpatrick I–VI: 10–22% per type) and anatomic site (≥10 distinct body regions).
DermBench (Shen et al., 12 Nov 2025): 4,000 dermatology photos, each paired with an expert-verified diagnostic chain-of-thought narrative.
DermoBench (Ru et al., 5 Jan 2026): 33,999 VQA pairs from over 900 held-out images, spanning 11 subtasks and four clinical axes.
SkinGPT-R1 certified test split (Shen et al., 19 Nov 2025): 3,000 dermatologist-audited cases.

Annotation Schema

Hierarchical VQA (DermaBench):
- 22 main questions (single-choice, multi-select, open-ended) partitioned into:
- Fundamental image/meta (modality, quality, Fitzpatrick type)
- Detailed dermatologic VQA (anatomic site, lesion count, morphology, border, surface, color, size, distribution, primary/secondary morphologies)
- Integrative narrative summary (structured clinical report)
Diagnostic Narratives (DermBench):
- Structured chain-of-thought (CoT) narratives using prompts that elicit observations, differential assessment, and reasoned conclusion.
DermoBench (DermoGPT):
- 11 subtasks across morphology, diagnosis, reasoning, and fairness, including both closed-ended MCQA (multiple choice question answering) and open-ended report and reasoning generation, with explicit <morph> JSON schemas, step-wise reasoning tags, and disentanglement of attributes for morphologically-grounded evaluation (Ru et al., 5 Jan 2026).

Quality control is enforced by double annotation, adjudication, exclusion of irreconcilable/poor quality samples, and in the large-scale DermoBench, by line-by-line expert revision of all open-ended references.

3. Evaluation Protocols, Metrics, and Fairness

Task Structure and Metrics

Evaluation in the DermoGPT Benchmark is multi-modal and axis-aligned:

Image attribute/extraction (VQA):
- Metrics: Accuracy (single-choice), macro/micro F₁ (multi-label), standard information-retrieval indices (e.g., Tversky, BLEU, BERTScore for text overlap) (Yilmaz et al., 20 Jan 2026, Ru et al., 5 Jan 2026).
Diagnostic chain-of-thought and narratives:
- Open-ended tasks assessed via LLM-as-a-Judge paradigms, decomposing each gold narrative into atomic claims, and scoring candidate texts for support/contradiction/omission, yielding a recall-like score and penalizing hallucinations or unsupported statements (Ru et al., 5 Jan 2026).
Multi-dimensional scoring (DermBench):
- s = [Acc, Safe, MedG, Cover, Reason, Desc] on 1–5 scale: Accuracy (diagnosis match), Safety (absence of harmful advice), Medical Groundedness, Clinical Coverage, Reasoning Coherence, Description Precision (Shen et al., 12 Nov 2025, Shen et al., 19 Nov 2025).
- System-level determination via mean dimension scores and mean deviation from expert gold (see Table/Algorithm 1 in (Shen et al., 12 Nov 2025)).
Fairness:
- Accuracy per Fitzpatrick type with the fairness ratio defined as minₖ Accₖ / maxₖ Accₖ, analyzing performance disparities between skin tones (Ru et al., 5 Jan 2026).

Judge and Reference-Free Evaluation

LLM-as-judge protocols provide claim-level, structured critique and scoring, designed to enforce reproducibility via fixed system prompts, seed control, and standardized rubrics (Shen et al., 12 Nov 2025, Ru et al., 5 Jan 2026).
DermEval reference-free evaluator: A vision-language encoder models image/text compatibility, outputs per-dimension 1–5 ratings and narrative critiques, and regresses to LLM-judge scores; InfoNCE loss aligns image and text features for robust, reference-independent assessment (Shen et al., 12 Nov 2025, Shen et al., 19 Nov 2025).

Alignment with Human Performance

Expert dermatologists are included as a baseline for all open-ended tasks in DermoBench, providing clinical ceilings for model assessment (Ru et al., 5 Jan 2026).

4. Task Suite Coverage and Subtasks

DermoGPT Benchmarks comprise a comprehensive set of tasks that target the breadth of visual diagnostic reasoning in dermatology. The most advanced protocol (DermoBench in DermoGPT) includes:

Clinical Axis	Sample Subtasks	Task Type
Morphology	Detailed/free-text description; <morph> JSON creation; MCQA	Open/closed
Diagnosis	4-way/25-way MCQA; hierarchical ontology traversal; OOD MCQA	Closed
Reasoning	Chain-of-thought stepwise justification; morph-grounded CoT	Open-ended
Fairness	Accuracy stratified by Fitzpatrick types, per-site analysis	Derived

Tasks are strictly isolated from training data, ensuring no data leakage. Each subtask is designed to probe not only recognition skill but also semiotic grounding, evidentiary reasoning, and robustness to phenotype/setting drifts (Ru et al., 5 Jan 2026, Yilmaz et al., 20 Jan 2026).

5. Baseline Systems and Performance

A broad range of models are evaluated:

General-purpose MLLMs: GPT-4o-mini, Gemini-2.5-Flash, Qwen2.5-VL-72B, Llama-3.2, etc.
Medical/dermatology-specialized models: HuatuoGPT-Vis-7B, LLaVA-Med-v1.5, SkinVL-PubMM, SkinGPT-R1, DermoGPT-RL.
Human clinicians: provide performance bounds.

Key results across DermoGPT Benchmark (Ru et al., 5 Jan 2026):

Model	Morph (T1 avg)	Diagnosis avg	Reasoning (T3 avg)	Fairness
Gemini-2.5-Flash	49.9	63.4	53.7	79.9
DermoGPT-RL + CCT	59.8	78.0	67.2	93.9
Human ceiling	81.9	83.2	80.3	94.0

DermoGPT-RL + CCT achieves state-of-the-art performance, especially on fairness (ratio 93.9 vs. 94.0), substantially narrowing the human–AI gap in diagnostic accuracy, open-ended reasoning, and skin-type equity. Notably, substantial performance gaps remain in free-text morphology description (59.8 vs. 81.9), underscoring ongoing challenges for the field (Ru et al., 5 Jan 2026, Shen et al., 19 Nov 2025).

SkinGPT-R1’s introduction of dermatologist-aligned CoT supervision and vision-adapter distillation further improved open-ended narrative quality by 41% over prior Vision-R1 backbones, highlighting the leverage of domain-specific reasoning (Shen et al., 19 Nov 2025).

6. Extensibility, Best Practices, and Future Directions

DermoGPT Benchmarks are released as metadata-rich, extensible frameworks:

Data Access and Licensing: Metadata only (e.g., DermaBench/Harvard Dataverse), with images in upstream datasets under controlled-access academic licenses (Yilmaz et al., 20 Jan 2026).
Fairness Stress Testing: Built-in capacity to analyze performance stratified by skin type, anatomic site, and rare phenotype, supporting fairness audits and mitigating bias.
Explainability and Interactive Probing: Advanced protocols suggest follow-up question design (e.g., next-step management, biopsy recommendation), tool use, explanation alignment (visual attribution), and chain-of-thought probes, all supporting transparent evaluation of reasoning and clinical safety (Yilmaz et al., 20 Jan 2026).
Best Practices: Recommended practices include maintaining a balanced lesion taxonomy, reporting deviation and narrative critiques, re-calibrating evaluators (e.g., DermEval) for new test distributions, and ensuring strict reproducibility via fixed random seeds and prompt templates (Shen et al., 12 Nov 2025).
Expansion: Future benchmark customizations aim to extend into additional diagnostic specialties, introduce scenario-based reasoning (e.g., temporal evolution, treatment simulation), and embed integration with clinical workflow software.

7. Significance and Ongoing Challenges

The DermoGPT Benchmark paradigm has advanced the evaluation of dermatology MLLMs beyond superficial classification, enabling systematic, granular assessment of models’ perceptual and inferential capabilities under realistic constraints. By enforcing clinical workflow emulation, open-ended justification, and multi-dimensional safety/accuracy assessment — all while supporting strict fairness and reproducibility — it provides a template for responsible medical AI benchmarking (Ru et al., 5 Jan 2026, Yilmaz et al., 20 Jan 2026, Shen et al., 12 Nov 2025, Shen et al., 19 Nov 2025).

Persistent challenges include attaining human-expert parity in granular morphology reporting, generalizing across rare phenotypes and imaging modalities, and operationalizing explainability in a clinically satisfactory manner. These benchmarks remain central to trustworthy benchmarking, validation, and ongoing development of robust, fair, and clinically-aligned dermatology MLLMs.