SkinGPT-R1: Dermatology Vision-Language Model

Updated 13 April 2026

The model introduces a domain-specialized, adapter-only framework that advances diagnostic chain-of-thought reasoning in clinical dermatology using dual distillation.
It employs lightweight visual and language adapters atop a frozen multimodal backbone to efficiently integrate specialized visual feature transfer and narrative supervision.
It establishes reproducible research with standardized resources like DermCoT, DermEval, and DermBench, outperforming baselines with state-of-the-art performance metrics.

SkinGPT-R1 is a domain-specialized, adapter-only vision-LLM designed to advance explicit, verifiable diagnostic chain-of-thought (CoT) reasoning in clinical dermatology. It achieves state-of-the-art narrative reasoning performance through a dual distillation strategy that combines dermatology-specific visual feature transfer and high-quality chain-of-thought narrative supervision, while maintaining computational efficiency by training only lightweight adapters atop a frozen multimodal backbone. Its development introduces a standardized dermatological reasoning corpus (DermCoT), a robust six-dimensional clinician-aligned evaluation protocol (DermEval), and a large-scale benchmarking dataset (DermBench) to set a reproducible foundation for future research in vision-language reasoning within medical domains (Shen et al., 19 Nov 2025).

1. Model Architecture and Adapter Design

SkinGPT-R1 employs Vision-R1-7B as its fixed multimodal backbone, which integrates a Vision Transformer (ViT) style visual encoder with an autoregressive language decoder. Domain adaptation and task transfer are achieved exclusively via two lightweight adapters inserted along the vision-to-language interface. The first, a visual alignment head, projects patch-level image features into a dermatology-specific teacher embedding space $\mathbb{R}^{D_t}$ . The second, a low-rank language adapter, augments the language decoder by biasing vocabulary logits according to a distilled image summary projected into the decoder’s hidden space.

For an input image $x$ , ViT produces patch embeddings $\{f_i\}_{i=1}^{P}$ , aggregated into a summary vector $s\in\mathbb{R}^{d_s}$ using residual bottlenecks. The adapters compute two outputs:

Student projection: $\mathbf{z} = W_{\mathrm{vis}} s + b_{\mathrm{vis}}$ with $W_{\mathrm{vis}} \in \mathbb{R}^{D_t \times d_s}$ , $b_{\mathrm{vis}} \in \mathbb{R}^{D_t}$ .
Image summary for language bias: $\mathbf{h}_v = U s + v$ , $U \in \mathbb{R}^{d_h \times d_s}$ , $v \in \mathbb{R}^{d_h}$ . This is transformed via the frozen language head $x$ 0 into $x$ 1 and added to decoder logits with a scaling gate $x$ 2 on supervised positions.

All backbone parameters remain stationary; only adapter weights are optimized. At inference, the adapter operations reduce to a static projection and a matrix addition, preserving the original backbone’s latency and memory footprint.

2. Dual Distillation and Training Objective

SkinGPT-R1’s training comprises two complementary distillation processes:

Visual Distillation: A fixed, dermatologist-trained encoder provides dermatology-aware target embeddings $x$ 3. The adapter output $x$ 4 is trained to minimize mean squared error (MSE) to the teacher embedding: $x$ 5.
Chain-of-Thought Distillation: Supervised fine-tuning is performed on DermCoT narrative data using cross-entropy over next-token prediction, with network logits modified by the vocabulary bias vector $x$ 6. For supervised positions $x$ 7, logit updates are $x$ 8 with $x$ 9 an indicator for the relevant tokens.

A weighted sum combines the losses, with weight schedules $\{f_i\}_{i=1}^{P}$ 0 and $\{f_i\}_{i=1}^{P}$ 1 controlled by a cosine ramp-up based on normalized training progress $\{f_i\}_{i=1}^{P}$ 2: $\{f_i\}_{i=1}^{P}$ 3 This regime facilitates concurrent transfer of dermatology-specialized visual features and narrative diagnostic reasoning.

3. DermCoT Corpus and Standardized Supervisory Narratives

DermCoT is constructed from DermNet images annotated by domain experts. Generation of standardized chain-of-thought narratives follows a three-layer pipeline:

Observation-Only Caption: Generated by a pretrained general vision-LLM (VLM) with no diagnostic content.
Label-Aware, Diagnosis-Concluding Draft: Incorporating expert labels and diagnostic reasoning.
Canonical Formatting and Normalization: Structured as:
- Layer 1: Visual findings (site, morphology, color, distribution)
- Layer 2: Evidence-based, differential reasoning explicitly supporting clinical separation
- Layer 3: Conclusive diagnosis with succinct clinical plan

The final corpus comprises 10,000 training narratives filtered by DermEval (mean score $\{f_i\}_{i=1}^{P}$ 4), with diagnosis and anatomical site balance, and 3,000 dermatologist-locked certified cases reserved for benchmarking. This hierarchy ensures both clinical realism and standardization in step-wise logical reasoning.

4. Evaluation Protocols: DermEval and DermBench

Evaluation employs two components:

DermEval: A fine-tuned LLaVA-based automated assessor outputs six-field structure-aligned scores and free-text critique for each image–narrative pair. It is trained using cross-entropy and REINFORCE on 15,000 dermatologist-annotated pairs to match physician ratings and is applied offline to curate high-quality DermCoT training cases.
DermBench: Serves as the held-out test set of 3,000 dermatologically certified cases. For each image, 14 VLMs’ narratives are assessed under a fixed evaluator prompt, yielding scores across six dimensions: accuracy, safety, medical groundedness, clinical coverage, reasoning coherence, and description precision.

Scoring is computed as: $\{f_i\}_{i=1}^{P}$ 5 where $\{f_i\}_{i=1}^{P}$ 6 denotes the dimension $\{f_i\}_{i=1}^{P}$ 7 score for case $\{f_i\}_{i=1}^{P}$ 8.

5. Experimental Results and Ablations

On DermBench, SkinGPT-R1 achieves a mean score of 4.031/5, ranking 1st among all evaluated models and surpassing Vision-R1 by 41% (2.865/5). Scores by dimension are: accuracy (3.476), safety (4.187), groundedness (3.459), coverage (4.403), coherence (4.026), description precision (4.637).

Zero-shot classification results demonstrate consistent performance gains over the Vision-R1 baseline:

Model	DermBench Avg	Derm7pt (14 cls)	PAD-UFES-20 (7 cls)	ISIC (39 cls)
Vision-R1	2.865	27.3%	31.7%	7.0%
SkinGPT-R1 w/o visual distill	3.404	29.6%	33.4%	7.3%
SkinGPT-R1 (full)	4.031	32.9%	37.6%	8.6%

The ablation studies reveal that DermCoT-based CoT supervision accounts for a +19% rise over baseline DermBench, while dermatology-aware visual distillation provides a further +18% increment, along with consistent improvements in all classification tasks.

6. Implementation and Hyperparameters

Training is performed on 10,000 DermCoT pairs using a 90/10 train/val split, following a Qwen2.5-VL input format and 4096 token context. Precision is bfloat16, with fallback to fp32 as required. Training is distributed across 8×A100 GPUs (batch size 8 per GPU, total 64). The learning rate is set to $\{f_i\}_{i=1}^{P}$ 9 with 5% linear warm-up and cosine decay. The optimizer is AdamW with default weight decay, and checkpoint selection is determined by minimum validation loss.

7. Limitations and Prospective Developments

SkinGPT-R1’s training and benchmarking rely primarily on curated DermNet images; thus, generalizability to broader demographics, device types, and skin tone distributions remains to be validated. There exists risk of bias related to image acquisition and label distributions. Future directions include data expansion using MM-Skin, Derm1M, and MAKE corpora, integration of skin tone stratification and fairness assessments, and adaptation of the adapter-only dual distillation framework to other visually-driven clinical specialties such as radiology and ophthalmology (Shen et al., 19 Nov 2025). Fine-grained lesion localization and information retrieval could further strengthen clinical grounding and practical utility.

SkinGPT-R1 demonstrates that adapter-only, dual-distillation—explicitly coupling domain-aware visual feature transfer with standardized narrative supervision—enables efficient, high-quality dermatology-specific diagnostic reasoning, with the DermCoT–DermEval–DermBench toolchain establishing a reproducible research standard in medical vision-language modeling.

Markdown Report Issue Upgrade to Chat

References (1)

SkinGPT-R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SkinGPT-R1.

SkinGPT-R1: Dermatology Vision-Language Model

1. Model Architecture and Adapter Design

2. Dual Distillation and Training Objective

3. DermCoT Corpus and Standardized Supervisory Narratives

4. Evaluation Protocols: DermEval and DermBench

5. Experimental Results and Ablations

6. Implementation and Hyperparameters

7. Limitations and Prospective Developments

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SkinGPT-R1: Dermatology Vision-Language Model

1. Model Architecture and Adapter Design

2. Dual Distillation and Training Objective

3. DermCoT Corpus and Standardized Supervisory Narratives

4. Evaluation Protocols: DermEval and DermBench

5. Experimental Results and Ablations

6. Implementation and Hyperparameters

7. Limitations and Prospective Developments

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research