MultiPathQA—ExpertVQA Subset
- The paper establishes a gold standard for expert-level visual question answering on gigapixel pathology images through pathologist-curated, multiple-choice assessments.
- It employs a rigorous multi-resolution annotation pipeline and consensus quality assurance from expert pathologists to ensure clinical and diagnostic relevance.
- Benchmarking reveals that GIANT-powered GPT-5 significantly outperforms specialist models, highlighting the challenges in expert diagnostic VQA.
MultiPathQA—ExpertVQA Subset is a benchmark for evaluating visual question answering (VQA) capabilities on gigapixel pathology images at an expert clinical level. Curated as a subset within the broader MultiPathQA framework, ExpertVQA operationalizes rigorous, pathologist-authored assessment of model competence in direct slide interpretation—encompassing both spatial feature localization and challenging diagnostic classification. By leveraging full-resolution whole-slide images (WSIs) and multiple-choice items grounded in real clinical diagnostic workflows, the ExpertVQA subset establishes a gold standard for high-difficulty pathology VQA, revealing the precise limits and emergent proficiencies of large multimodal models (LMMs) and specialized domain models (Buckley et al., 24 Nov 2025).
1. Construction Pipeline and Data Sources
The ExpertVQA subset was assembled from the TCGA Uniform cohort, comprising 8,736 WSIs annotated over 32 cancer types. The selection process started with stratified sampling of the 20 most frequent primary sites (one slide per patient); from this pool, 90 WSIs were preliminarily chosen. Two pathologists—a resident and an attending—further filtered these to 76 diagnostically and technically high-quality slides on which to author questions.
Each WSI is delivered as a multiresolution tile pyramid aligned with clinical digital pathology standards, up to 40× magnification. For baseline methods, a 1,024×1,024 pixel thumbnail overview is also available. GIANT, the agentic navigation framework, dynamically chooses regions of interest, cropping with a long side of pixels at the maximal available resolution not exceeding this dimension, without fixed tiling or overlap constraints—mirroring pathologist interaction with digital slides (Buckley et al., 24 Nov 2025).
2. Annotation Methodology and Quality Assurance
All 128 multiple-choice questions in ExpertVQA were authored directly by the two pathologists. Each question required pan-and-zoom review of the WSI before composition. Questions were structured to map closely to authentic diagnostic scenarios, either pinpointing a histopathological feature (“Region-of-Interest”) or rendering a precise diagnosis (“Diagnostic Classification”).
Consensus quality assurance was central. After initial drafting, each author independently reviewed all answers, and disagreements were resolved by joint re-examination of the slides. Quality was thus secured through expert domain knowledge and collaborative consensus; however, no quantitative inter-rater reliability statistic (e.g., Cohen’s κ) was reported. The validity of each item was grounded in clinical plausibility and unambiguous visual evidence.
3. Question Types and Design Features
ExpertVQA questions are exclusively multiple-choice (four options), spanning 20 diagnostic categories (e.g., discrimination between lung adenocarcinoma and squamous cell carcinoma; glioblastoma and astrocytoma). The two principal question formats are:
- Region-of-Interest (ROI) Localization: Requires identification of a histological hallmark or tissue feature in a specified region of the slide.
- Diagnostic Classification: Requires selection of the correct diagnostic or subtype label based on cue regions or whole-slide findings.
Representative examples include:
- ROI: “Inspect the entire slide at variable magnifications. Which region contains extensive tumor necrosis?” Options (A–D) reference positions or tissue compartments; answer is a letter.
- Classification: “Based on high-power morphology, what is the most likely tumor subtype?” Options are precise cancer subtypes; answer is a letter.
Each question references the full WSI in its prompt, and all correct labels are pathologist-verified (Buckley et al., 24 Nov 2025).
4. Evaluation Metrics and Statistical Framework
The primary evaluation for ExpertVQA is categorical accuracy. For questions, if is the number correct:
Because this is a single-label, multiple-choice task, per-question precision and recall coincide, and F1 is not reported. For completeness, class-level metrics can be defined as:
Additional metrics include pathologist review of reasoning traces; models’ region selections (“zoom choices”) and diagnostic rationales are rated as “clinically coherent.” GIANT-powered GPT-5 achieved 62.9% coherence for region selection and 38.6% for final rationale, but these are reported as raw percentages without LaTeX-formalized metrics (Buckley et al., 24 Nov 2025).
5. Model Benchmarking and Comparative Results
Performance on the ExpertVQA subset highlights a significant gap between foundation LMMs empowered by agentic navigation and traditional specialist pathology models. With five-run bootstrapping, GPT-5 paired with GIANT reaches 62.5% mean accuracy (±4.4% standard deviation over 1,000 replicates). By contrast:
- TITAN: 43.8% (±4.2%)
- SlideChat: 37.5% (±4.3%)
Single-run accuracy for GPT-5+GIANT is 57.0% (±4.5%). Other baselines include GPT-4o+GIANT at 40.6% (±4.5%) and Claude-4.5-Sonnet+GIANT at 49.2% (±4.4%). The non-overlapping bootstrap confidence intervals between GPT-5+GIANT and specialist baselines indicate statistical significance of the improvement.
A comparison with open-ended and general-domain VQA further contextualizes difficulty: on PathVQA, best specialist models achieve 68.2% on yes/no, but only 2.9% exact match for open-ended clinical questions (F₁=24.0%), while general-domain VQA models attain >60% exact match (He et al., 2020). This suggests that expert-level diagnostic VQA remains an unsolved problem for both domain and generalist systems.
6. Design Criteria for “Expert-Level” Subsets
ExpertVQA departs from large-scale, synthetic, crowd-sourced VQA in both its item curation and difficulty calibration. Key selection criteria for "expert-level" questions include:
- Complex, board-style diagnostic reasoning requiring integration of clinical history, morphologic detail, and visual findings.
- Multi-step spatial reasoning necessitating hierarchical localization (e.g., identifying adjacent but distinct anatomical regions).
- Emphasis on rare or low-frequency answer concepts: questions are preferentially drawn from the long tail of answer distributions, as identified by inverse-frequency difficulty (He et al., 2020).
- Direct authorship and multi-pass verification by trained pathologists to exclude ambiguous, trivial, or non-inferential tasks.
These criteria distinguish ExpertVQA from earlier open-ended VQA collections, yielding a compact but diagnostically rich testbed capable of discriminating model performance at the high end of clinical competence.
7. Limitations, Open Issues, and Future Directions
Several constraints are inherent to the current ExpertVQA release. The benchmark is modest in scope (128 questions, 76 slides), only samples the 20 most frequent TCGA cancer sites, and reports no measure of inter-rater reliability. All items utilize a multiple-choice format; performance on open-ended, free-text, and explicit region-retrieval questions remains unaddressed.
Pathologist reviewing of reasoning traces reveals recurrent LMM failure modes: premature diagnostic anchoring, hallucination of unseen features, and incoherent synthesis of visual and clinical cues. Addressing these issues will likely require:
- Larger, multi-institutional and disease-diverse expert-annotated datasets.
- Expansion beyond multiple-choice to open-ended and spatial retrieval question types.
- Incorporation of formal localization accuracy metrics.
- Integration of domain-aware tissue feature detectors within the agentic modeling framework (Buckley et al., 24 Nov 2025).
A plausible implication is that high-fidelity VQA in digital pathology, especially at gigapixel WSI scale, will demand ongoing methodological innovation in agentic navigation, multimodal representation, and consensus-driven clinical annotation.