OmniMedVQA Benchmark for Medical VQA
- OmniMedVQA is a comprehensive benchmark featuring over 118,000 images paired with diverse clinical QA items across 12 imaging modalities.
- It employs rigorous dataset construction with LLM-enhanced, template-based question generation and expert human verification to ensure clinical validity.
- Benchmark evaluations reveal model vulnerabilities under adversarial settings, emphasizing the need for robust and reasoning-aware medical AI systems.
OmniMedVQA is a large-scale, multi-modality benchmark specifically curated for the rigorous evaluation of medical Large Vision-LLMs (LVLMs) and Vision-LLMs (VLMs) in the domain of medical Visual Question Answering (VQA). It is designed to stress-test reasoning, understanding, and robustness across a diverse spectrum of real-world medical imaging modalities and clinical query types, supporting both classification-style and open-ended VQA evaluation. Its central role in the medical VQA landscape is underscored by its adoption in benchmarking and robustness research, including recent works on adversarial defenses, resource-efficient architectures, and comparative competency studies (Pramana et al., 22 Dec 2025, Alsinglawi et al., 8 Apr 2025, Liu et al., 15 Jul 2025, Hu et al., 2024).
1. Benchmark Construction and Dataset Composition
OmniMedVQA is constructed from a composite of 73 medical datasets, including both fully open-access and restricted-access sources, and encompasses 12 imaging modalities: Computed Tomography (CT), Magnetic Resonance Imaging (MR/MRI), X-Ray, Ultrasound (US), Dermoscopy (Der), Endoscopy (End), Fundus Photography (FP), Microscopy (Mic), Optical Coherence Tomography (OCT), Digital Photography (DP), Infrared Reflectance Imaging (IRI), and Colposcopy (Co) (Hu et al., 2024).
- Coverage: Over 20 anatomical regions, including brain, spine, lung, liver, kidney, eye, and musculoskeletal sites.
- QA Construction: Each image is paired with up to five question types, provided through template-based generation refined for linguistic diversity via LLMs (GPT-3.5), and finalized with a human verification step to ensure clinical validity and balanced sampling.
- Scale: The canonical release reports 118,010 images and 127,995 QA items (Hu et al., 2024). Typical experimental splits, such as those in SafeMed-R1 and LLama-CLIP works, use subsets in the 80,000–89,000 sample range spread across eight principal modalities (Pramana et al., 22 Dec 2025, Alsinglawi et al., 8 Apr 2025).
| Modality | Images (Canonical) | Typical Experiment Split (Train/Test) |
|---|---|---|
| MRI | 31,917 | 25,507 / 6,370 |
| CT | 14,457 | 12,567 / 3,241 |
| Ultrasound | 10,855 | 8,917 / 2,074 |
| Fundus Photo | 10,108 | 4,300 / 1,098 |
| Microscopy | 19,785 | 4,570 / 1,110 |
| Dermoscopy | 5,967 | 5,373 / 1,306 |
| OCT | 3,791 | 3,798 / 848 |
| X-Ray | 7,594 | 6,301 / 1,615 |
2. Taxonomy of Question Types and Answer Formats
OmniMedVQA supports a broad spectrum of clinically-relevant VQA question formulations, systematically covering:
- Diagnosis: Selection among disease or pathology labels.
- Presence/Absence: Binary queries (e.g., “Is there hemorrhage present?”).
- Anatomical Localization: Selection of region or organ from a fixed set.
- Modality Recognition, Anatomy Identification, Lesion Grading, and Other Biological Attributes are included in the full corpus.
Answer formats are partitioned as follows:
- Closed-set class label: For diagnosis/localization; evaluated by exact match.
- Binary yes/no: For presence queries.
- Short free-form text: For "other" or open-ended categories (less frequent).
- Multiple-choice format: Canonical zero-shot evaluations supply 3–5 distractor options per question (Hu et al., 2024, Liu et al., 15 Jul 2025).
- Open-ended adaptation: Recent works adapt the corpus to permit unconstrained answer generation, mapping outputs to choices via embedding similarity (Alsinglawi et al., 8 Apr 2025).
The per-modality and per-category label distributions mirror real-world case prevalences, with MRI dominating (≈36%), and OCT and Fundus forming ≈6% each in most experimental configurations (Pramana et al., 22 Dec 2025).
3. Evaluation Protocols and Metrics
OmniMedVQA supports several evaluation protocols compatible with classification-style, multi-choice, and generative VQA architectures:
- Accuracy (ACC):
This metric is standard for both closed and open-ended sub-tasks, requiring exact match to ground truth (Alsinglawi et al., 8 Apr 2025, Liu et al., 15 Jul 2025).
- Question-Answering Score (QA-Acc): For multi-choice tasks in canonical zero-shot setting:
- Prefix-Based Score (PB-Acc): For each candidate option and question :
The answer with maximal log-likelihood is selected.
Notably, OmniMedVQA benchmarks increasingly demand robustness evaluation:
- Robust Accuracy under Attack: Measures ACC on adversarially perturbed images.
- Area Under the Accuracy-Under-Attack Curve (AUA): Tracks accuracy over a range of attack strengths (Pramana et al., 22 Dec 2025).
- Certified Robustness via Randomized Smoothing:
where , are estimated probabilities for the top answer and runner-up under Gaussian-noise smoothed inference.
4. Baseline and State-of-the-Art Model Performance
Multiple published baselines illuminate both the strengths and persistent limitations of diverse VQA methods evaluated on OmniMedVQA.
| Model (General Domain) | QA-Acc (%) | PB-Acc (%) | Closed-Ended ACC (%) |
|---|---|---|---|
| BLIP-2 | 50.69 | 33.43 | 48.12 |
| InstructBLIP | 42.49 | 28.71 | 40.4 |
| LLaMA-Adapter-v2 | 33.15 | 29.88 | N/A |
| LLaVA | 27.85 | 20.02 | N/A |
| Model (Medical Specialized) | QA-Acc (%) | PB-Acc (%) | Closed-Ended ACC (%) |
|---|---|---|---|
| MedVInT | 41.50 | 25.81 | — |
| LLaVA-Med | 28.78 | 24.06 | — |
| RadFM | 26.82 | 29.00 | 26.99 |
Resource-efficient state-of-the-art methods include BiomedCLIP+LLaMA-3-8B (“LLama-CLIP”): ACC 73.4% (open-ended), 76.9% (yes/no closed-ended), using only two A100-40GB GPUs (Alsinglawi et al., 8 Apr 2025). Larger models (Huatuo-GPT-Vision-32B, Lingshu-32B) achieve up to 76.6% in zero-shot multi-choice settings (Liu et al., 15 Jul 2025).
Adversarial evaluation reveals:
- Fine-tuned Qwen3-VL-4B ("Instruct" style): 95.43% clean ACC collapses to 26.07% under strong PGD attack.
- SafeMed-R1: Adversarially trained and randomized smoothing—robust accuracy at 84.45% (a 59-point improvement over standard fine-tuning) (Pramana et al., 22 Dec 2025).
5. Robustness, Reasoning, and Failure Modes
The benchmark exposes model vulnerabilities to adversarial perturbations and highlights reasoning challenges:
- Adversarial Fragility: Standard VLMs lose 70% accuracy under -PGD perturbations, while robust training (AT-GRPO + smoothing) restores most robust accuracy.
- Reasoning vs. Understanding: Analysis partitions questions into “understanding” (∼40%) and “reasoning” (∼60%). Large, domain-tuned VLMs show up to 0.75 reasoning accuracy, with some models closing the gap between understanding and reasoning more effectively than generalist models (Liu et al., 15 Jul 2025).
- Modality-Specific Gaps: High-frequency texture modalities (e.g., Fundus, OCT) are especially sensitive to adversarial noise in non-robust models. MRI remains hardest for open-ended systems, with the lowest accuracy (e.g., 69.2% for LLama-CLIP) (Alsinglawi et al., 8 Apr 2025, Pramana et al., 22 Dec 2025).
- Interpretability and Robustness: Chain-of-thought reasoning traces improve adversarial recoverability, as demonstrated by SafeMed-R1’s “Think” model outperforming “Instruct” in perturbed regimes (Pramana et al., 22 Dec 2025).
6. Benchmark Insights, Limitations, and Recommendations
Key conclusions and open challenges highlighted by published analyses:
- General-Purpose Superiority: Large general-domain VLMs with strong pre-training (e.g., BLIP-2) regularly outperform current medical-specialized systems, especially for modality or anatomy queries akin to natural images (Hu et al., 2024).
- Alignment Gaps: Medical LVLMs struggle with rare modalities and subtle pathological cues due to insufficient domain-specific pretraining and limited instruction tuning.
- Overfitting and Dataset Uniformity: Repetitive QA pairs and templates can lead to memorization in smaller models, undermining real clinical robustness (Alsinglawi et al., 8 Apr 2025).
- Clinical Reliability Threshold: Even top-performing systems err on ∼23% of cases; this precludes clinical deployment and motivates continued research in multimodal alignment, reasoning-aware training, and more diverse supervision (Liu et al., 15 Jul 2025).
- Recommendations: Scale high-quality medical image–text corpora; integrate multi-task alignment objectives (segmentation, detection, reporting); conduct more detailed ablations and modality-specific tuning (Hu et al., 2024).
7. Significance and Future Directions
OmniMedVQA has established itself as a reference benchmark for medical LVLM/VQA progress, supporting:
- Unified, multi-modal evaluation across radiology, histopathology, ophthalmology, dermatology, and other subfields.
- Systematic adversarial robustness assessment, now a critical requirement for medical AI trustworthiness.
- Direct study of medical reasoning, interpretability, and modality-specific challenges.
- Comparative benchmarking for both resource-intensive and lightweight clinical AI architectures.
Future work is motivated by pipeline enhancements—richer medical image–text corpora for pretraining, explicit incorporation of multi-step reasoning training, open-ended narrative generation, and certified defenses against adversarial manipulations to close the safety gap for real-world deployments (Pramana et al., 22 Dec 2025, Hu et al., 2024).