MedGaze-Bench: Gaze-Anchored Clinical Benchmark
- MedGaze-Bench is a gaze-anchored benchmark suite that evaluates Med-MLLMs by probing spatial, temporal, and standard clinical intents.
- It integrates clinician eye-tracking data from open surgery, emergency simulation, diagnostic radiology, and mammography to derive performance metrics.
- Its comprehensive framework reveals limitations in model procedural reasoning and safety compliance, highlighting issues like perceptual and cognitive hallucinations.
MedGaze-Bench is a rigorously validated, gaze-anchored benchmark suite designed to evaluate machine-learning models—particularly Medical Multimodal LLMs (Med-MLLMs)—in their capability to interpret egocentric clinical intent for real-world medical applications. MedGaze-Bench is grounded in the use of clinician gaze as a "Cognitive Cursor" to probe nuanced understanding across diverse modalities, including open surgery, emergency simulation, diagnostic radiology, and mammography. Its architecture uniquely addresses challenges arising from the visual homogeneity of anatomical structures, the temporally and causally complex nature of clinical workflows, and the implicit necessity of strict adherence to safety protocols. By framing evaluation across spatial, temporal, and protocol-compliance dimensions, MedGaze-Bench provides a comprehensive assessment of both general and adversarial (Trap QA) clinical reasoning capabilities, exposing critical deficiencies in even state-of-the-art Med-MLLMs (Liu et al., 11 Jan 2026).
1. Three-Dimensional Clinical Intent Framework
MedGaze-Bench assesses egocentric clinical intent understanding through a structured three-dimensional framework, partitioned into the following axes and sub-capabilities:
- Spatial Intent ("Where?")
- Absolute Localization: Identifying precise anatomical or instrumental targets at the clinician's fixation, robust to anatomical homogeneity.
- Relative Localization: Articulating the spatial relation of the fixation target to surrounding structures or tools.
- Temporal Intent ("Why/When?")
- Retrospective Attribution: Inferencing prerequisite actions or procedural history via causal "backward" reasoning.
- Prospective Anticipation: Forecasting the next clinical operation or goal enabled by the current action, i.e., causal "forward" reasoning.
- Standard Intent ("How?")
- Discrete Safety Verification: Confirming compliance with critical standard safety checks (e.g., nerve integrity, instrument clearance) at specified workflow keypoints.
- Continuous Safety Vigilance: Verifying persistent maintenance of safety constraints (e.g., hemostasis, protection of vital structures) across extended frame sequences.
Each sub-capability is quantified by accuracy, and the overall score is defined as
where indexes the six enumerated sub-capabilities.
2. Data Collection, Gaze Processing, and Source Modalities
MedGaze-Bench aggregates expert gaze data from four distinct sources:
- Open Surgery Videos: 20 cases over 10 procedure types performed by 8 surgeons.
- Emergency Simulation: Breech delivery scenarios in controlled, standardized simulations.
- Chest X-ray Interpretation: Specialist radiologist gaze trajectories (MIMIC-Eye).
- Mammography Interpretation: Breast radiologist fixations (Mammo-Gaze).
Gaze was recorded using head-mounted eye trackers at 120 Hz (video at 30 fps for surgery, 10 fps for simulation, static for radiology). In the processing pipeline, outlier fixations (blinks, dropout) are discarded via median filtering, raw gaze coordinates are normalized to frame coordinates, and fixations are clustered using DBSCAN (, minPts=5), defining Areas of Interest (AOIs). Statistical cluster analysis reveals a significant concentration (>70% of fixation time) around key anatomical sites during open procedures, capturing characteristic gaze and dwell-time transitions across Standard Operating Procedure (SOP) phases. Gaze location is encoded both as a semi-transparent overlay in video frames and as explicit textual metadata in task prompts.
3. Dataset Construction, Curation, and Task Formats
The dataset comprises 4,491 curated samples, with scenario and task breakdown as follows:
| Intent Dimension | Number of MCQs | Example Task |
|---|---|---|
| Spatial | 1,229 | “Which structure is beneath my current focus?” |
| Temporal | 2,028 | “What should I do next to ensure hemostasis?” |
| Standard | 1,234 | “Have I confirmed nerve preservation per guidelines?” |
| Trap QA (adversarial) | 600 | See Section 4 |
Tasks are represented in JSON format, containing scenario identifiers, video frames, gaze coordinates, intent dimension labels, subtypes, question prompts, answer options, and correct responses. The “Gaze-Anchored Prompting” method (LLM pipeline using GPT-4o, first-person narrative) is filtered for clinical rigor via specialist review.
4. Trap QA: Adversarial Reliability and Hallucination Probes
Clinical reliability is interrogated through two classes of adversarial multiple-choice questions (“Trap QA”):
- Type I: Perceptual Hallucination – The prompt correctly describes visible content, but distractor answers reference objects/tools not present. Robust models must avoid fabricated answers.
- Type II: Cognitive Hallucination (Instruction Sycophancy) – The prompt contains a false premise; only one answer identifies this logical error (“Error Detection”). Safe models should select the error-detection response rather than justify impossible actions.
Metrics are reported as
and
to summarize reliability under adversarial settings.
5. Evaluation Protocols and Experimental Benchmarks
The evaluation comprehensively covers proprietary, open-source, specialist, and medical-specific MLLMs in a pure zero-shot setting (no domain or gaze-specific fine-tuning). Input comprises sequence frames, gaze overlays, and gaze-referencing textual prompts. Metrics include:
- Clinical Competency: and .
- Clinical Reliability: , , .
Model classes tested:
- Proprietary: GPT-5, Gemini 3 Pro.
- Open-source general: Qwen3VL 32B, Intern3.5VL 38B, Qwen3-VL 30B-A3B-Thinking, Qwen3-VL 235B-A22B-Instruct.
- Medical-specific: LingShu 32B, MedGemma 27B.
- Egocentric specialist: EgoLife 7B.
6. Key Experimental Insights
Major findings from the MedGaze-Bench evaluation include:
- Scale and Architecture Dependency: GPT-5 attains the highest at 62.28%, with the mixture-of-expert Qwen3-VL 235B-A22B surpassing larger dense Intern3.5VL 38B (58.86% vs. 53.73%).
- Temporal Intent Asymmetry: Open-source models can predict next steps (~60% in Prospective Anticipation) but fail at reconstructing preceding workflow stages (~35% in Retrospective Attribution), revealing a deficit in backward causal reasoning.
- Domain Adaptation Limit: Medical knowledge-focused models (LingShu, MedGemma) do not exceed ~53% accuracy, indicating that declarative knowledge does not compensate for deficits in procedural logic.
- Egocentric Specialist Limitation: EgoLife 7B achieves only 45.86% overall and 19.67% , displaying frequent rationalization of impossible tasks.
- Reliability Bottleneck: No model exceeds 70% average reliability (). Intern3.5VL is the most perceptually reliable (69.82%), but highly susceptible to cognitive hallucinations (54.67%). Gemini 3 Pro reaches the highest (69.84%; 62.00% perceptual, 77.67% cognitive).
- Gaze Prompting Effect: Adding gaze overlays and metadata improves generalist model accuracy by 3–4% (especially standard intent), but yields marginal (<1%) or negative gains in domain-specific or egocentric specialist models.
Error analysis established that all tested systems tend to rely excessively on global scene features or language priors, manifesting as both perceptual hallucinations (inventions of invisible objects) and sycophantic cognitive hallucinations.
7. Limitations and Future Directions
Current coverage is limited to open surgery, emergency simulation, and selected radiology tasks. Expansion to further specialties (e.g., interventional cardiology, critical care, outpatient consultation) is an open direction. The exclusive use of multiple-choice questions, while analytically tractable, does not encompass the full spectrum of open-ended clinical reasoning. The present approach overlays gaze information as simple visual/textual prompts; integration into end-to-end architectures capable of sequence-level gaze reasoning is an anticipated advance. Future benchmarks may also include calibration metrics and interactive, real-time safety tasks (e.g., video interruption on safety protocol deviation) to deepen compliance assessment.
MedGaze-Bench thus constitutes a foundational testbed for intent-centric, gaze-informed evaluation of medical vision-LLMs, revealing key safety, causal, and procedural limitations in the current SOTA and delineating a roadmap for the development of clinically trustworthy artificial intelligence (Liu et al., 11 Jan 2026).