Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniMedVQA Benchmark for Medical VQA

Updated 29 December 2025
  • OmniMedVQA is a comprehensive benchmark featuring over 118,000 images paired with diverse clinical QA items across 12 imaging modalities.
  • It employs rigorous dataset construction with LLM-enhanced, template-based question generation and expert human verification to ensure clinical validity.
  • Benchmark evaluations reveal model vulnerabilities under adversarial settings, emphasizing the need for robust and reasoning-aware medical AI systems.

OmniMedVQA is a large-scale, multi-modality benchmark specifically curated for the rigorous evaluation of medical Large Vision-LLMs (LVLMs) and Vision-LLMs (VLMs) in the domain of medical Visual Question Answering (VQA). It is designed to stress-test reasoning, understanding, and robustness across a diverse spectrum of real-world medical imaging modalities and clinical query types, supporting both classification-style and open-ended VQA evaluation. Its central role in the medical VQA landscape is underscored by its adoption in benchmarking and robustness research, including recent works on adversarial defenses, resource-efficient architectures, and comparative competency studies (Pramana et al., 22 Dec 2025, Alsinglawi et al., 8 Apr 2025, Liu et al., 15 Jul 2025, Hu et al., 2024).

1. Benchmark Construction and Dataset Composition

OmniMedVQA is constructed from a composite of 73 medical datasets, including both fully open-access and restricted-access sources, and encompasses 12 imaging modalities: Computed Tomography (CT), Magnetic Resonance Imaging (MR/MRI), X-Ray, Ultrasound (US), Dermoscopy (Der), Endoscopy (End), Fundus Photography (FP), Microscopy (Mic), Optical Coherence Tomography (OCT), Digital Photography (DP), Infrared Reflectance Imaging (IRI), and Colposcopy (Co) (Hu et al., 2024).

  • Coverage: Over 20 anatomical regions, including brain, spine, lung, liver, kidney, eye, and musculoskeletal sites.
  • QA Construction: Each image is paired with up to five question types, provided through template-based generation refined for linguistic diversity via LLMs (GPT-3.5), and finalized with a human verification step to ensure clinical validity and balanced sampling.
  • Scale: The canonical release reports 118,010 images and 127,995 QA items (Hu et al., 2024). Typical experimental splits, such as those in SafeMed-R1 and LLama-CLIP works, use subsets in the 80,000–89,000 sample range spread across eight principal modalities (Pramana et al., 22 Dec 2025, Alsinglawi et al., 8 Apr 2025).
Modality Images (Canonical) Typical Experiment Split (Train/Test)
MRI 31,917 25,507 / 6,370
CT 14,457 12,567 / 3,241
Ultrasound 10,855 8,917 / 2,074
Fundus Photo 10,108 4,300 / 1,098
Microscopy 19,785 4,570 / 1,110
Dermoscopy 5,967 5,373 / 1,306
OCT 3,791 3,798 / 848
X-Ray 7,594 6,301 / 1,615

2. Taxonomy of Question Types and Answer Formats

OmniMedVQA supports a broad spectrum of clinically-relevant VQA question formulations, systematically covering:

  • Diagnosis: Selection among disease or pathology labels.
  • Presence/Absence: Binary queries (e.g., “Is there hemorrhage present?”).
  • Anatomical Localization: Selection of region or organ from a fixed set.
  • Modality Recognition, Anatomy Identification, Lesion Grading, and Other Biological Attributes are included in the full corpus.

Answer formats are partitioned as follows:

  • Closed-set class label: For diagnosis/localization; evaluated by exact match.
  • Binary yes/no: For presence queries.
  • Short free-form text: For "other" or open-ended categories (less frequent).
  • Multiple-choice format: Canonical zero-shot evaluations supply 3–5 distractor options per question (Hu et al., 2024, Liu et al., 15 Jul 2025).
  • Open-ended adaptation: Recent works adapt the corpus to permit unconstrained answer generation, mapping outputs to choices via embedding similarity (Alsinglawi et al., 8 Apr 2025).

The per-modality and per-category label distributions mirror real-world case prevalences, with MRI dominating (≈36%), and OCT and Fundus forming ≈6% each in most experimental configurations (Pramana et al., 22 Dec 2025).

3. Evaluation Protocols and Metrics

OmniMedVQA supports several evaluation protocols compatible with classification-style, multi-choice, and generative VQA architectures:

  • Accuracy (ACC):

Accuracy=Number of Correct AnswersTotal Number of QA Pairs=TP+TNTP+TN+FP+FN\mathrm{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of QA Pairs}} = \frac{TP + TN}{TP + TN + FP + FN}

This metric is standard for both closed and open-ended sub-tasks, requiring exact match to ground truth (Alsinglawi et al., 8 Apr 2025, Liu et al., 15 Jul 2025).

  • Question-Answering Score (QA-Acc): For multi-choice tasks in canonical zero-shot setting:

QA-Acc=#{predictions=ground-truth}#{questions}\text{QA-Acc} = \frac{\#\{\text{predictions} = \text{ground-truth}\}}{\#\{\text{questions}\}}

  • Prefix-Based Score (PB-Acc): For each candidate option aia_i and question qq:

(q,ai)=t=1TlogP(wt(i)w<t(i),image,q)\ell(q,a_i) = \sum_{t=1}^{T}\log P(w_t^{(i)}\mid w_{<t}^{(i)},\text{image},q)

The answer with maximal log-likelihood is selected.

Notably, OmniMedVQA benchmarks increasingly demand robustness evaluation:

  • Robust Accuracy under Attack: Measures ACC on adversarially perturbed images.
  • Area Under the Accuracy-Under-Attack Curve (AUA): Tracks accuracy over a range of attack strengths (Pramana et al., 22 Dec 2025).
  • Certified Robustness via Randomized Smoothing:

Rcert=σ2[Φ1(pA)Φ1(pB)]R_\text{cert} = \frac{\sigma}{2}\big[\Phi^{-1}(p_A) - \Phi^{-1}(p_B)\big]

where pAp_A, pBp_B are estimated probabilities for the top answer and runner-up under Gaussian-noise smoothed inference.

4. Baseline and State-of-the-Art Model Performance

Multiple published baselines illuminate both the strengths and persistent limitations of diverse VQA methods evaluated on OmniMedVQA.

Model (General Domain) QA-Acc (%) PB-Acc (%) Closed-Ended ACC (%)
BLIP-2 50.69 33.43 48.12
InstructBLIP 42.49 28.71 40.4
LLaMA-Adapter-v2 33.15 29.88 N/A
LLaVA 27.85 20.02 N/A
Model (Medical Specialized) QA-Acc (%) PB-Acc (%) Closed-Ended ACC (%)
MedVInT 41.50 25.81
LLaVA-Med 28.78 24.06
RadFM 26.82 29.00 26.99

Resource-efficient state-of-the-art methods include BiomedCLIP+LLaMA-3-8B (“LLama-CLIP”): ACC 73.4% (open-ended), 76.9% (yes/no closed-ended), using only two A100-40GB GPUs (Alsinglawi et al., 8 Apr 2025). Larger models (Huatuo-GPT-Vision-32B, Lingshu-32B) achieve up to 76.6% in zero-shot multi-choice settings (Liu et al., 15 Jul 2025).

Adversarial evaluation reveals:

  • Fine-tuned Qwen3-VL-4B ("Instruct" style): 95.43% clean ACC collapses to 26.07% under strong L2L_2 PGD attack.
  • SafeMed-R1: Adversarially trained and randomized smoothing—robust accuracy at 84.45% (a 59-point improvement over standard fine-tuning) (Pramana et al., 22 Dec 2025).

5. Robustness, Reasoning, and Failure Modes

The benchmark exposes model vulnerabilities to adversarial perturbations and highlights reasoning challenges:

  • Adversarial Fragility: Standard VLMs lose 70% accuracy under L2L_2-PGD perturbations, while robust training (AT-GRPO + smoothing) restores most robust accuracy.
  • Reasoning vs. Understanding: Analysis partitions questions into “understanding” (∼40%) and “reasoning” (∼60%). Large, domain-tuned VLMs show up to 0.75 reasoning accuracy, with some models closing the gap between understanding and reasoning more effectively than generalist models (Liu et al., 15 Jul 2025).
  • Modality-Specific Gaps: High-frequency texture modalities (e.g., Fundus, OCT) are especially sensitive to adversarial noise in non-robust models. MRI remains hardest for open-ended systems, with the lowest accuracy (e.g., 69.2% for LLama-CLIP) (Alsinglawi et al., 8 Apr 2025, Pramana et al., 22 Dec 2025).
  • Interpretability and Robustness: Chain-of-thought reasoning traces improve adversarial recoverability, as demonstrated by SafeMed-R1’s “Think” model outperforming “Instruct” in perturbed regimes (Pramana et al., 22 Dec 2025).

6. Benchmark Insights, Limitations, and Recommendations

Key conclusions and open challenges highlighted by published analyses:

  • General-Purpose Superiority: Large general-domain VLMs with strong pre-training (e.g., BLIP-2) regularly outperform current medical-specialized systems, especially for modality or anatomy queries akin to natural images (Hu et al., 2024).
  • Alignment Gaps: Medical LVLMs struggle with rare modalities and subtle pathological cues due to insufficient domain-specific pretraining and limited instruction tuning.
  • Overfitting and Dataset Uniformity: Repetitive QA pairs and templates can lead to memorization in smaller models, undermining real clinical robustness (Alsinglawi et al., 8 Apr 2025).
  • Clinical Reliability Threshold: Even top-performing systems err on ∼23% of cases; this precludes clinical deployment and motivates continued research in multimodal alignment, reasoning-aware training, and more diverse supervision (Liu et al., 15 Jul 2025).
  • Recommendations: Scale high-quality medical image–text corpora; integrate multi-task alignment objectives (segmentation, detection, reporting); conduct more detailed ablations and modality-specific tuning (Hu et al., 2024).

7. Significance and Future Directions

OmniMedVQA has established itself as a reference benchmark for medical LVLM/VQA progress, supporting:

  • Unified, multi-modal evaluation across radiology, histopathology, ophthalmology, dermatology, and other subfields.
  • Systematic adversarial robustness assessment, now a critical requirement for medical AI trustworthiness.
  • Direct study of medical reasoning, interpretability, and modality-specific challenges.
  • Comparative benchmarking for both resource-intensive and lightweight clinical AI architectures.

Future work is motivated by pipeline enhancements—richer medical image–text corpora for pretraining, explicit incorporation of multi-step reasoning training, open-ended narrative generation, and certified defenses against adversarial manipulations to close the safety gap for real-world deployments (Pramana et al., 22 Dec 2025, Hu et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniMedVQA Benchmark.