RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models

Published 6 Mar 2025 in cs.CV, cs.AI, cs.CL, and cs.LG | (2503.03987v1)

Abstract: Recently, Multimodal LLMs (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT

Abstract PDF Upgrade to Chat

Summary

The paper introduces RetinalGPT, a novel multimodal conversational assistant powered by large vision-language models for quantitative analysis of retinal images.
RetinalGPT utilizes a large, custom retinal dataset and a two-stage instruction tuning method to gain expertise in retinal analysis while retaining general medical knowledge.
Evaluations show RetinalGPT achieves superior performance on multiple ophthalmic benchmarks for disease diagnosis, lesion localization, and vascular analysis, demonstrating potential for broader medical applicability.

The paper "RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-LLMs" introduces RetinalGPT, a novel multimodal conversational assistant designed for the quantitative analysis of retinal images using Multimodal LLMs (MLLMs). The study addresses the limitations of general-domain MLLMs in performing specialized tasks such as the interpretation of retinal images, crucial for diagnosing ocular diseases. The authors highlight the gap between general-domain and medical-domain MLLMs and propose RetinalGPT to bridge this gap by enhancing retinal disease diagnosis capabilities.

Key Contributions:

Retinal-Specific Dataset and Pipeline:
- The paper details the creation of a large, diverse dataset of approximately 38,000 retinal images. This dataset is enriched with disease labels, lesion bounding boxes, and vascular features. The data pipeline includes clinical data extraction using tools like AutoMorph for fractal analysis of retinal vascular structures, assigning clinically meaningful features to each image.
Instruction Tuning and Training:
- RetinalGPT employs a customized visual instruction tuning method to enhance its retinal analysis capabilities. By employing a two-stage training strategy, the model aligns generic-domain VLMs to be effective in retinal domain tasks while preserving broader biomedical knowledge:
  - Stage 1 (Feature Alignment): Mixup of retinal-specific and general biomedical datasets is used to tune the model, maintaining generic medical domain knowledge.
  - Stage 2 (Mixup Instruction-Tuning): Fine-tuning on a mixed dataset combining retinal-specific instruction data with generic medical data helps retain general medical understanding alongside retinal-specific capabilities.
Performance and Evaluation:
- RetinalGPT is evaluated against several state-of-the-art models on eight benchmark datasets covering multiple ophthalmic diseases. It demonstrates superior performance, particularly in disease diagnosis, lesion localization, and vascular structure analysis.
- Results showcase RetinalGPT's successful lesion localization capability, predicting lesion bounding boxes with high accuracy compared to ground truth annotations. It also accurately estimates vascular feature values, validating the precision of its analysis.
Generalization to Generic Medical Domain:
- The model's response similarity to the LLaVA-Med model when tested on generic medical questions indicates that RetinalGPT preserves knowledge beyond the retinal domain. This demonstrates its extensive potential applicability in broader medical imaging contexts.

Conclusion:

RetinalGPT marks a significant advancement in retinal image analysis by leveraging large-scale multimodal models to improve clinical diagnostics' quantitative and interpretative dimensions. It stands out for its capability to integrate both extensive biomedical domain knowledge and focused retinal domain expertise to facilitate detailed and interpretable end-to-end clinical frameworks. The authors note the model's limitation regarding modality-centric initial responses and plan to address this in future work to enhance conversational dynamics.

Markdown Report Issue