SkinGPT-R1: Dermatology Vision-Language Model
- The model introduces a domain-specialized, adapter-only framework that advances diagnostic chain-of-thought reasoning in clinical dermatology using dual distillation.
- It employs lightweight visual and language adapters atop a frozen multimodal backbone to efficiently integrate specialized visual feature transfer and narrative supervision.
- It establishes reproducible research with standardized resources like DermCoT, DermEval, and DermBench, outperforming baselines with state-of-the-art performance metrics.
SkinGPT-R1 is a domain-specialized, adapter-only vision-LLM designed to advance explicit, verifiable diagnostic chain-of-thought (CoT) reasoning in clinical dermatology. It achieves state-of-the-art narrative reasoning performance through a dual distillation strategy that combines dermatology-specific visual feature transfer and high-quality chain-of-thought narrative supervision, while maintaining computational efficiency by training only lightweight adapters atop a frozen multimodal backbone. Its development introduces a standardized dermatological reasoning corpus (DermCoT), a robust six-dimensional clinician-aligned evaluation protocol (DermEval), and a large-scale benchmarking dataset (DermBench) to set a reproducible foundation for future research in vision-language reasoning within medical domains (Shen et al., 19 Nov 2025).
1. Model Architecture and Adapter Design
SkinGPT-R1 employs Vision-R1-7B as its fixed multimodal backbone, which integrates a Vision Transformer (ViT) style visual encoder with an autoregressive language decoder. Domain adaptation and task transfer are achieved exclusively via two lightweight adapters inserted along the vision-to-language interface. The first, a visual alignment head, projects patch-level image features into a dermatology-specific teacher embedding space . The second, a low-rank language adapter, augments the language decoder by biasing vocabulary logits according to a distilled image summary projected into the decoder’s hidden space.
For an input image , ViT produces patch embeddings , aggregated into a summary vector using residual bottlenecks. The adapters compute two outputs:
- Student projection: with , .
- Image summary for language bias: , , . This is transformed via the frozen language head 0 into 1 and added to decoder logits with a scaling gate 2 on supervised positions.
All backbone parameters remain stationary; only adapter weights are optimized. At inference, the adapter operations reduce to a static projection and a matrix addition, preserving the original backbone’s latency and memory footprint.
2. Dual Distillation and Training Objective
SkinGPT-R1’s training comprises two complementary distillation processes:
- Visual Distillation: A fixed, dermatologist-trained encoder provides dermatology-aware target embeddings 3. The adapter output 4 is trained to minimize mean squared error (MSE) to the teacher embedding: 5.
- Chain-of-Thought Distillation: Supervised fine-tuning is performed on DermCoT narrative data using cross-entropy over next-token prediction, with network logits modified by the vocabulary bias vector 6. For supervised positions 7, logit updates are 8 with 9 an indicator for the relevant tokens.
A weighted sum combines the losses, with weight schedules 0 and 1 controlled by a cosine ramp-up based on normalized training progress 2: 3 This regime facilitates concurrent transfer of dermatology-specialized visual features and narrative diagnostic reasoning.
3. DermCoT Corpus and Standardized Supervisory Narratives
DermCoT is constructed from DermNet images annotated by domain experts. Generation of standardized chain-of-thought narratives follows a three-layer pipeline:
- Observation-Only Caption: Generated by a pretrained general vision-LLM (VLM) with no diagnostic content.
- Label-Aware, Diagnosis-Concluding Draft: Incorporating expert labels and diagnostic reasoning.
- Canonical Formatting and Normalization: Structured as:
- Layer 1: Visual findings (site, morphology, color, distribution)
- Layer 2: Evidence-based, differential reasoning explicitly supporting clinical separation
- Layer 3: Conclusive diagnosis with succinct clinical plan
The final corpus comprises 10,000 training narratives filtered by DermEval (mean score 4), with diagnosis and anatomical site balance, and 3,000 dermatologist-locked certified cases reserved for benchmarking. This hierarchy ensures both clinical realism and standardization in step-wise logical reasoning.
4. Evaluation Protocols: DermEval and DermBench
Evaluation employs two components:
- DermEval: A fine-tuned LLaVA-based automated assessor outputs six-field structure-aligned scores and free-text critique for each image–narrative pair. It is trained using cross-entropy and REINFORCE on 15,000 dermatologist-annotated pairs to match physician ratings and is applied offline to curate high-quality DermCoT training cases.
- DermBench: Serves as the held-out test set of 3,000 dermatologically certified cases. For each image, 14 VLMs’ narratives are assessed under a fixed evaluator prompt, yielding scores across six dimensions: accuracy, safety, medical groundedness, clinical coverage, reasoning coherence, and description precision.
Scoring is computed as: 5 where 6 denotes the dimension 7 score for case 8.
5. Experimental Results and Ablations
On DermBench, SkinGPT-R1 achieves a mean score of 4.031/5, ranking 1st among all evaluated models and surpassing Vision-R1 by 41% (2.865/5). Scores by dimension are: accuracy (3.476), safety (4.187), groundedness (3.459), coverage (4.403), coherence (4.026), description precision (4.637).
Zero-shot classification results demonstrate consistent performance gains over the Vision-R1 baseline:
| Model | DermBench Avg | Derm7pt (14 cls) | PAD-UFES-20 (7 cls) | ISIC (39 cls) |
|---|---|---|---|---|
| Vision-R1 | 2.865 | 27.3% | 31.7% | 7.0% |
| SkinGPT-R1 w/o visual distill | 3.404 | 29.6% | 33.4% | 7.3% |
| SkinGPT-R1 (full) | 4.031 | 32.9% | 37.6% | 8.6% |
The ablation studies reveal that DermCoT-based CoT supervision accounts for a +19% rise over baseline DermBench, while dermatology-aware visual distillation provides a further +18% increment, along with consistent improvements in all classification tasks.
6. Implementation and Hyperparameters
Training is performed on 10,000 DermCoT pairs using a 90/10 train/val split, following a Qwen2.5-VL input format and 4096 token context. Precision is bfloat16, with fallback to fp32 as required. Training is distributed across 8×A100 GPUs (batch size 8 per GPU, total 64). The learning rate is set to 9 with 5% linear warm-up and cosine decay. The optimizer is AdamW with default weight decay, and checkpoint selection is determined by minimum validation loss.
7. Limitations and Prospective Developments
SkinGPT-R1’s training and benchmarking rely primarily on curated DermNet images; thus, generalizability to broader demographics, device types, and skin tone distributions remains to be validated. There exists risk of bias related to image acquisition and label distributions. Future directions include data expansion using MM-Skin, Derm1M, and MAKE corpora, integration of skin tone stratification and fairness assessments, and adaptation of the adapter-only dual distillation framework to other visually-driven clinical specialties such as radiology and ophthalmology (Shen et al., 19 Nov 2025). Fine-grained lesion localization and information retrieval could further strengthen clinical grounding and practical utility.
SkinGPT-R1 demonstrates that adapter-only, dual-distillation—explicitly coupling domain-aware visual feature transfer with standardized narrative supervision—enables efficient, high-quality dermatology-specific diagnostic reasoning, with the DermCoT–DermEval–DermBench toolchain establishing a reproducible research standard in medical vision-language modeling.