PubMedVision: Medical Vision–Language Dataset

Updated 6 February 2026

PubMedVision is a high-quality, large-scale medical vision–language dataset featuring 1.3 million VQA pairs refined from PubMed full-text articles.
It employs a rigorous multimodal pipeline leveraging GPT-4V for unblinded reformatting, advanced filtering, and deduplication to ensure precise image–text alignment.
Empirical evaluations show that models like HuatuoGPT-Vision achieve significant accuracy improvements on various medical VQA benchmarks using this dataset.

PubMedVision is a large-scale, high-quality medical vision–language dataset derived from refined and denoised image–text pairs sourced from PubMed. Created using a rigorous multimodal pipeline that leverages GPT-4V in an “unblinded” capacity, PubMedVision contains 1.3 million medical visual question answering (VQA) samples designed to enhance the performance of multimodal LLMs (MLLMs) in medical contexts. The dataset has demonstrated significant improvements on established medical VQA benchmarks and constitutes the foundational resource for training the HuatuoGPT-Vision series of medical MLLMs (Chen et al., 2024).

1. Data Sources and Preprocessing

PubMedVision aggregates medical image–text pairs by pooling three large-scale, publicly available crawls of PubMed full-text papers: LLaVA-Med PMC (approximately 514,000 images), PMC-Inline (approximately 11 million), and PMC-OA (approximately 1 million). The initial dataset includes figures alongside their associated text—captions and inline mentions—from PubMed articles.

Text-based filtering leverages a medical lexicon derived from the UMLS SPECIALIST Lexicon, pruned by GPT-4 to focus on medically dense passages. Only image–text pairs where the contextual text includes at least five unique medical terms are retained. Image filtering employs a CLIP-based classifier, trained using 1,000 manually labeled and 10,000 GPT-4V–labeled samples, to exclude low-resolution images (less than 336×336 pixels) and non-medical figures such as charts and graphs. The classifier achieves 91% accuracy on a held-out test set.

Deduplication is achieved by encoding remaining captions with Sentence-BERT (all-mpnet-base-v2) and pruning pairs with dot-product similarity above 0.48. The result is a curated set of 914,960 unique medical image–text pairs spanning diverse body parts (head & neck, chest, abdomen, limbs, etc.) and imaging modalities (CT, MRI, X-ray, ultrasound, fundus photography, OCT, dermoscopy, microscopy, endoscopy) (Chen et al., 2024).

2. Dataset Cleaning, Denoising, and Reformatting

PubMedVision applies an “unblinded” reformatting workflow using GPT-4V, in which both the image(s) $\mathbb{I}$ and associated context $X$ are provided jointly to the model. GPT-4V interprets each pair to generate a triplet $(d, q, a)$ , where $d$ is a descriptive summary, $q$ is a question (for instruction tuning), and $a$ is the corresponding answer.

Two types of VQA samples are generated:

Alignment VQA: The question $q'$ is sampled from a small set of human-written prompts (e.g., “Please describe this picture”), and GPT-4V generates an answer $a$ given description $d$ .
Instruction-Tuning VQA: GPT-4V autonomously generates both the question $q$ and the answer $a$ under one of ten pre-defined scenario prompts (including standard QA, peer-to-peer doctor interactions, intern-specialist discussions, and AI-model-to-doctor/patient scenarios), maximizing diversity in question style and instruction adherence.

Noisy data is filtered with criteria enforced by GPT-4V's visual context: samples with ambiguous or irrelevant captions, factual inaccuracies, insufficient token counts, or generic disclaimers (“consult your doctor”) are excluded. The denoising is further enforced by output length constraints and prompt engineering to refuse hallucinated statements (Chen et al., 2024).

3. Dataset Structure and Composition

PubMedVision comprises a total of 1,294,062 VQA pairs, equally split between the two types of VQA samples:

647,031 Alignment-VQA
647,031 Instruction-Tuning-VQA

Questions reflect ten distinct clinical, didactic, or AI-assistance scenarios, ranging from simple image description to complex clinician-to-clinician queries and model evaluation tasks. Answers are open-ended and purely textual, typically consisting of concise medical summaries or multi-sentence analyses. For example, under the standard QA scenario, a typical item is:

Question: “What is the location of the mass observed in the MRI image?”
Answer: “The mass is located within the left sphenoid sinus, occupying a significant portion of that sinus with homogenous gadolinium enhancement and no signs of bony invasion.”

No fixed train/validation/test splits are provided. Instead, data is partitioned into alignment and instruction sets, which correspond directly to the pretraining and finetuning stages for model development.

VQA Type	Number of Samples	Generation Strategy
Alignment VQA	647,031	Human-prompted, GPT-4V answers
Instruction-Tuning VQA	647,031	GPT-4V-generated Q&A, 10 scenarios

4. Empirical Evaluation and Expert Validation

PubMedVision’s impact on medical multimodal learning is validated by systematic benchmarking and expert review. Four medical VQA benchmarks—VQA-RAD, SLAKE (closed English), PathVQA, and PMC-VQA—are used for zero-shot evaluation. When fine-tuned on PubMedVision, LLaVA-1.5-LLaMA-3-8B demonstrates marked improvements:

Baseline: 51.0% accuracy
+LLaVA_Med: 55.6% (+4.6)
+PubMedVision: 62.7% (+11.7)
HuatuoGPT-Vision-34B: 66.7%

Further, on OmniMedVQA (42 imaging tasks covering diverse modalities), PubMedVision yields 75.1% accuracy (baseline: 48.8%, +LLaVA_Med: 65.5%, HuatuoGPT-Vision: 76.7%). On the MMMU Health & Medicine track, PubMedVision supports an increase from 38.2% (baseline) and 41.1% (+LLaVA_Med) to 49.1%, with HuatuoGPT-Vision reaching 54.4%.

Manual reviews by three medical experts on 360 randomly chosen samples (scored 1–5 for accuracy, relevance, completeness, practicality) attribute the highest mean ratings (≈4.5–4.8) to PubMedVision’s “unblinded” GPT-4V outputs, compared to native captions (≈3.0–3.8) and text-only LLM reformats (≈3.5–4.2). This suggests notable improvement in clinical utility and factual grounding. No formal inter-annotator agreement metric (e.g., Fleiss’ κ) is reported, but expert consensus is high (standard deviation <0.4 across raters) (Chen et al., 2024).

5. Integration into Downstream MLLMs: HuatuoGPT-Vision

PubMedVision is the primary pretraining resource for HuatuoGPT-Vision, an open-source medical MLLM. The base architecture uses LLaVA-1.5 with a CLIP-Large encoder (336×336) and a two-layer MLP projector, with LLaMA-3-8B or Yi-1.5-34B as language backbones. Training comprises two stages with identical hyperparameters to LLaVA-1.5: pretraining on 558,000 LLaVA-1.5 examples and 647,000 PubMedVision Alignment-VQA items, followed by finetuning with 658,000 LLaVA-1.5 instances and 647,000 PubMedVision Instruction-Tuning-VQA samples.

Enhancements include expansion to a Yi-1.5-34B backbone, 348,000 PubMedVision pairs translated to Chinese, and text-only medical pretraining from the HuatuoGPT-II corpora. Although the exact GPU count is undisclosed, training matches the 8B-parameter LLaVA regime and plausibly utilizes 8–16 A100 GPUs per stage over approximately 24 hours. At 34B parameters, HuatuoGPT-Vision attains state-of-the-art open-source performance across all evaluated medical multimodal datasets, gains attributed in large part to PubMedVision’s scale and quality (Chen et al., 2024).

6. Significance and Implications

PubMedVision represents a foundational advance for medical multimodal research, providing the first large-scale, high-quality, systematically denoised, and richly annotated VQA dataset constructed through “unblinded” multimodal LLM reformatting. Empirical validation confirms that MLLMs trained on PubMedVision achieve major boosts in accuracy and clinical relevance across a variety of imaging modalities and clinical scenarios. The dataset’s design, encompassing multilingual capacity and scenario diversity, establishes a paradigm for medical vision–language alignment and instruction tuning, directly addressing longstanding bottlenecks related to data scale, quality, and annotation scarcity in medical AI research (Chen et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PubMedVision Dataset.