OmniMedVQA Benchmark for Medical VQA

Updated 29 December 2025

OmniMedVQA is a comprehensive benchmark featuring over 118,000 images paired with diverse clinical QA items across 12 imaging modalities.
It employs rigorous dataset construction with LLM-enhanced, template-based question generation and expert human verification to ensure clinical validity.
Benchmark evaluations reveal model vulnerabilities under adversarial settings, emphasizing the need for robust and reasoning-aware medical AI systems.

OmniMedVQA is a large-scale, multi-modality benchmark specifically curated for the rigorous evaluation of medical Large Vision-LLMs (LVLMs) and Vision-LLMs (VLMs) in the domain of medical Visual Question Answering (VQA). It is designed to stress-test reasoning, understanding, and robustness across a diverse spectrum of real-world medical imaging modalities and clinical query types, supporting both classification-style and open-ended VQA evaluation. Its central role in the medical VQA landscape is underscored by its adoption in benchmarking and robustness research, including recent works on adversarial defenses, resource-efficient architectures, and comparative competency studies (Pramana et al., 22 Dec 2025, Alsinglawi et al., 8 Apr 2025, Liu et al., 15 Jul 2025, Hu et al., 2024).

1. Benchmark Construction and Dataset Composition

OmniMedVQA is constructed from a composite of 73 medical datasets, including both fully open-access and restricted-access sources, and encompasses 12 imaging modalities: Computed Tomography (CT), Magnetic Resonance Imaging (MR/MRI), X-Ray, Ultrasound (US), Dermoscopy (Der), Endoscopy (End), Fundus Photography (FP), Microscopy (Mic), Optical Coherence Tomography (OCT), Digital Photography (DP), Infrared Reflectance Imaging (IRI), and Colposcopy (Co) (Hu et al., 2024).

Coverage: Over 20 anatomical regions, including brain, spine, lung, liver, kidney, eye, and musculoskeletal sites.
QA Construction: Each image is paired with up to five question types, provided through template-based generation refined for linguistic diversity via LLMs (GPT-3.5), and finalized with a human verification step to ensure clinical validity and balanced sampling.
Scale: The canonical release reports 118,010 images and 127,995 QA items (Hu et al., 2024). Typical experimental splits, such as those in SafeMed-R1 and LLama-CLIP works, use subsets in the 80,000–89,000 sample range spread across eight principal modalities (Pramana et al., 22 Dec 2025, Alsinglawi et al., 8 Apr 2025).

Modality	Images (Canonical)	Typical Experiment Split (Train/Test)
MRI	31,917	25,507 / 6,370
CT	14,457	12,567 / 3,241
Ultrasound	10,855	8,917 / 2,074
Fundus Photo	10,108	4,300 / 1,098
Microscopy	19,785	4,570 / 1,110
Dermoscopy	5,967	5,373 / 1,306
OCT	3,791	3,798 / 848
X-Ray	7,594	6,301 / 1,615

2. Taxonomy of Question Types and Answer Formats

OmniMedVQA supports a broad spectrum of clinically-relevant VQA question formulations, systematically covering:

Diagnosis: Selection among disease or pathology labels.
Presence/Absence: Binary queries (e.g., “Is there hemorrhage present?”).
Anatomical Localization: Selection of region or organ from a fixed set.
Modality Recognition, Anatomy Identification, Lesion Grading, and Other Biological Attributes are included in the full corpus.

Answer formats are partitioned as follows:

Closed-set class label: For diagnosis/localization; evaluated by exact match.
Binary yes/no: For presence queries.
Short free-form text: For "other" or open-ended categories (less frequent).
Multiple-choice format: Canonical zero-shot evaluations supply 3–5 distractor options per question (Hu et al., 2024, Liu et al., 15 Jul 2025).
Open-ended adaptation: Recent works adapt the corpus to permit unconstrained answer generation, mapping outputs to choices via embedding similarity (Alsinglawi et al., 8 Apr 2025).

The per-modality and per-category label distributions mirror real-world case prevalences, with MRI dominating (≈36%), and OCT and Fundus forming ≈6% each in most experimental configurations (Pramana et al., 22 Dec 2025).

3. Evaluation Protocols and Metrics

OmniMedVQA supports several evaluation protocols compatible with classification-style, multi-choice, and generative VQA architectures:

Accuracy (ACC):

$\mathrm{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of QA Pairs}} = \frac{TP + TN}{TP + TN + FP + FN}$

This metric is standard for both closed and open-ended sub-tasks, requiring exact match to ground truth (Alsinglawi et al., 8 Apr 2025, Liu et al., 15 Jul 2025).

Question-Answering Score (QA-Acc): For multi-choice tasks in canonical zero-shot setting:

$\text{QA-Acc} = \frac{\#\{\text{predictions} = \text{ground-truth}\}}{\#\{\text{questions}\}}$

Prefix-Based Score (PB-Acc): For each candidate option $a_i$ and question $q$ :

$\ell(q,a_i) = \sum_{t=1}^{T}\log P(w_t^{(i)}\mid w_{<t}^{(i)},\text{image},q)$

The answer with maximal log-likelihood is selected.

Notably, OmniMedVQA benchmarks increasingly demand robustness evaluation:

Robust Accuracy under Attack: Measures ACC on adversarially perturbed images.
Area Under the Accuracy-Under-Attack Curve (AUA): Tracks accuracy over a range of attack strengths (Pramana et al., 22 Dec 2025).
Certified Robustness via Randomized Smoothing:

$R_\text{cert} = \frac{\sigma}{2}\big[\Phi^{-1}(p_A) - \Phi^{-1}(p_B)\big]$

where $p_A$ , $p_B$ are estimated probabilities for the top answer and runner-up under Gaussian-noise smoothed inference.

4. Baseline and State-of-the-Art Model Performance

Multiple published baselines illuminate both the strengths and persistent limitations of diverse VQA methods evaluated on OmniMedVQA.

Model (General Domain)	QA-Acc (%)	PB-Acc (%)	Closed-Ended ACC (%)
BLIP-2	50.69	33.43	48.12
InstructBLIP	42.49	28.71	40.4
LLaMA-Adapter-v2	33.15	29.88	N/A
LLaVA	27.85	20.02	N/A

Model (Medical Specialized)	QA-Acc (%)	PB-Acc (%)	Closed-Ended ACC (%)
MedVInT	41.50	25.81	—
LLaVA-Med	28.78	24.06	—
RadFM	26.82	29.00	26.99

Resource-efficient state-of-the-art methods include BiomedCLIP+LLaMA-3-8B (“LLama-CLIP”): ACC 73.4% (open-ended), 76.9% (yes/no closed-ended), using only two A100-40GB GPUs (Alsinglawi et al., 8 Apr 2025). Larger models (Huatuo-GPT-Vision-32B, Lingshu-32B) achieve up to 76.6% in zero-shot multi-choice settings (Liu et al., 15 Jul 2025).

Adversarial evaluation reveals:

Fine-tuned Qwen3-VL-4B ("Instruct" style): 95.43% clean ACC collapses to 26.07% under strong $L_2$ PGD attack.
SafeMed-R1: Adversarially trained and randomized smoothing—robust accuracy at 84.45% (a 59-point improvement over standard fine-tuning) (Pramana et al., 22 Dec 2025).

5. Robustness, Reasoning, and Failure Modes

The benchmark exposes model vulnerabilities to adversarial perturbations and highlights reasoning challenges:

Adversarial Fragility: Standard VLMs lose 70% accuracy under $L_2$ -PGD perturbations, while robust training (AT-GRPO + smoothing) restores most robust accuracy.
Reasoning vs. Understanding: Analysis partitions questions into “understanding” (∼40%) and “reasoning” (∼60%). Large, domain-tuned VLMs show up to 0.75 reasoning accuracy, with some models closing the gap between understanding and reasoning more effectively than generalist models (Liu et al., 15 Jul 2025).
Modality-Specific Gaps: High-frequency texture modalities (e.g., Fundus, OCT) are especially sensitive to adversarial noise in non-robust models. MRI remains hardest for open-ended systems, with the lowest accuracy (e.g., 69.2% for LLama-CLIP) (Alsinglawi et al., 8 Apr 2025, Pramana et al., 22 Dec 2025).
Interpretability and Robustness: Chain-of-thought reasoning traces improve adversarial recoverability, as demonstrated by SafeMed-R1’s “Think” model outperforming “Instruct” in perturbed regimes (Pramana et al., 22 Dec 2025).

6. Benchmark Insights, Limitations, and Recommendations

Key conclusions and open challenges highlighted by published analyses:

General-Purpose Superiority: Large general-domain VLMs with strong pre-training (e.g., BLIP-2) regularly outperform current medical-specialized systems, especially for modality or anatomy queries akin to natural images (Hu et al., 2024).
Alignment Gaps: Medical LVLMs struggle with rare modalities and subtle pathological cues due to insufficient domain-specific pretraining and limited instruction tuning.
Overfitting and Dataset Uniformity: Repetitive QA pairs and templates can lead to memorization in smaller models, undermining real clinical robustness (Alsinglawi et al., 8 Apr 2025).
Clinical Reliability Threshold: Even top-performing systems err on ∼23% of cases; this precludes clinical deployment and motivates continued research in multimodal alignment, reasoning-aware training, and more diverse supervision (Liu et al., 15 Jul 2025).
Recommendations: Scale high-quality medical image–text corpora; integrate multi-task alignment objectives (segmentation, detection, reporting); conduct more detailed ablations and modality-specific tuning (Hu et al., 2024).

7. Significance and Future Directions

OmniMedVQA has established itself as a reference benchmark for medical LVLM/VQA progress, supporting:

Unified, multi-modal evaluation across radiology, histopathology, ophthalmology, dermatology, and other subfields.
Systematic adversarial robustness assessment, now a critical requirement for medical AI trustworthiness.
Direct study of medical reasoning, interpretability, and modality-specific challenges.
Comparative benchmarking for both resource-intensive and lightweight clinical AI architectures.

Future work is motivated by pipeline enhancements—richer medical image–text corpora for pretraining, explicit incorporation of multi-step reasoning training, open-ended narrative generation, and certified defenses against adversarial manipulations to close the safety gap for real-world deployments (Pramana et al., 22 Dec 2025, Hu et al., 2024).