Papers
Topics
Authors
Recent
Search
2000 character limit reached

Capabilities of GPT-5 on Multimodal Medical Reasoning

Published 11 Aug 2025 in cs.CL and cs.AI | (2508.08224v1)

Abstract: Recent advances in LLMs have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

Summary

  • The paper demonstrates unprecedented improvements in text and multimodal medical reasoning, with GPT-5 achieving up to 95.84% accuracy and significant gains on MedXpertQA.
  • The study employs a unified zero-shot chain-of-thought prompting method to integrate heterogeneous data sources, including clinical text, structured indicators, and medical images.
  • The paper highlights GPT-5’s super-human performance on multimodal tasks, surpassing pre-licensed human experts and setting new benchmarks for clinical decision support.

Evaluation of GPT-5 for Multimodal Medical Reasoning

Introduction

The paper presents a comprehensive evaluation of GPT-5 and its variants (GPT-5-mini, GPT-5-nano) on multimodal medical reasoning tasks, benchmarking their performance against GPT-4o-2024-11-20 and pre-licensed human experts. The study addresses the critical challenge of integrating heterogeneous medical data—textual narratives, structured indicators, and medical images—within a unified reasoning framework. The authors employ standardized zero-shot chain-of-thought (CoT) prompting across diverse datasets, including MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD, to isolate model improvements from prompt engineering or dataset idiosyncrasies.

Datasets and Evaluation Protocol

The evaluation spans both text-based and multimodal medical QA/VQA datasets:

  • MedQA: Multiple-choice questions from US, Mainland China, and Taiwan medical licensing exams.
  • MMLU-Medical: Subset of MMLU focused on medical knowledge and reasoning.
  • USMLE Self Assessment: Official practice questions for Steps 1, 2 CK, and 3.
  • MedXpertQA: Expert-level benchmark with text-only and multimodal subsets, the latter incorporating complex clinical images and patient records.
  • VQA-RAD: Radiology-focused VQA dataset with binary yes/no questions linked to curated clinical images.

The unified prompting protocol utilizes zero-shot CoT reasoning, with explicit step-by-step rationale generation followed by a discrete answer selection. For multimodal items, images are appended to the initial user message, enabling integrated vision-language reasoning. Figure 1

Figure 1: A prompting design sample from MedXpertQA, illustrating the integration of clinical text and medical imaging in the input.

Results: Text-Based Medical Reasoning

GPT-5 demonstrates consistent and substantial improvements over GPT-4o and its own smaller variants across all text-based benchmarks. On MedQA (US 4-option), GPT-5 achieves 95.84% accuracy, a 4.80% absolute gain over GPT-4o. The most pronounced improvements are observed in MedXpertQA Text, with reasoning and understanding scores increasing by 26.33% and 25.30%, respectively. In MMLU medical subdomains, GPT-5 maintains near-ceiling performance (>>91%), with incremental gains in high-baseline categories, indicating that the model's upgrades primarily benefit complex reasoning tasks rather than factual recall.

Results: USMLE Self Assessment

GPT-5 outperforms all baselines on USMLE Steps 1, 2, and 3, with the largest margin (+4.17%) on Step 2, which emphasizes clinical decision-making. The average score across steps is 95.22%, exceeding typical human passing thresholds and demonstrating readiness for high-stakes clinical reasoning.

Results: Multimodal Medical Reasoning

GPT-5 achieves dramatic improvements in multimodal reasoning, particularly on MedXpertQA MM, with reasoning and understanding gains of +29.26% and +26.18% over GPT-4o. This magnitude of improvement suggests enhanced cross-modal attention and alignment within the model architecture. Notably, GPT-5 surpasses pre-licensed human experts by +24.23% (reasoning) and +29.40% (understanding) on MedXpertQA MM, marking a shift from human-comparable to super-human performance.

A representative case from MedXpertQA MM demonstrates GPT-5's ability to synthesize clinical narratives, laboratory data, and imaging findings to recommend appropriate high-stakes interventions. Figure 2

Figure 2: GPT-5 reasoning output and final answer for MedXpertQA: case MM-1993, showing stepwise integration of multimodal evidence and exclusion of incorrect options.

In contrast, GPT-5 scores slightly lower on VQA-RAD (70.92%) compared to GPT-5-mini (74.90%), possibly reflecting conservative reasoning calibration in the larger model for small-domain tasks.

Comparison with Human Experts

GPT-5 not only closes the performance gap with pre-licensed human experts but exceeds their scores by substantial margins in both text and multimodal settings. GPT-4o remains below human expert performance in most dimensions, underperforming by 5.03–15.90%. The magnitude of GPT-5's lead is most pronounced in multimodal reasoning, where its unified vision-language pipeline delivers integration of textual and visual evidence that surpasses experienced clinicians under time-limited test conditions.

Discussion

The evaluation reveals several key findings:

  • Substantial Gains in Multimodal Reasoning: GPT-5's improvements are most pronounced in tasks requiring tight integration of image-derived and textual evidence, suggesting architectural or training enhancements in cross-modal attention.
  • Strength in Reasoning-Intensive Tasks: Chain-of-thought prompting synergizes with GPT-5's internal reasoning capacity, enabling more accurate multi-hop inference, especially in complex clinical scenarios.
  • Super-Human Benchmark Performance: GPT-5 consistently exceeds pre-licensed human expert performance in controlled QA/VQA evaluations, highlighting its potential for clinical decision support. However, these results are obtained under idealized testing conditions and may not fully capture the complexity and uncertainty of real-world medical practice.
  • Scaling-Related Calibration Effects: The slight underperformance of GPT-5 on VQA-RAD compared to GPT-5-mini suggests that larger models may adopt more cautious reasoning strategies in small-domain tasks, warranting further investigation into adaptive prompting and calibration techniques.

Implications and Future Directions

The demonstrated capabilities of GPT-5 have significant implications for the design of future clinical decision-support systems. Its proficiency in integrating complex multimodal information streams and delivering accurate, well-justified recommendations positions it as a reliable core component for medical AI applications. However, the transition from benchmark evaluations to real-world deployment necessitates further research into prospective clinical trials, domain-adapted fine-tuning, and robust calibration methods to ensure safety, transparency, and ethical compliance.

Conclusion

This study provides a rigorous, longitudinal evaluation of GPT-5's capabilities in multimodal medical reasoning, establishing its superiority over GPT-4o, smaller GPT-5 variants, and pre-licensed human experts across diverse QA and VQA benchmarks. The model's substantial gains in reasoning-intensive and multimodal tasks mark a qualitative shift in LLM capabilities, bridging the gap between research prototypes and practical clinical tools. Future work should focus on validating these results in real-world clinical environments and developing strategies for safe and effective deployment.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (in simple terms)

This paper tests a new AI model called GPT-5 to see how well it can “think like a doctor” using both words and pictures. In medicine, doctors don’t just read text; they also look at scans (like CTs), lab numbers, and patient stories. The authors ask: Can one general AI model understand all of that together and make good decisions without special medical training?

The main questions the researchers asked

  • Can GPT-5 answer tough medical questions correctly when it only gets a normal prompt (no extra training), including questions that need reading medical images?
  • Is GPT-5 better than older models (like GPT-4o) and smaller versions (GPT-5-mini and GPT-5-nano)?
  • On these test-style questions, does GPT-5 do as well as, or even better than, trained human experts?
  • Does GPT-5 especially improve at tasks that require step-by-step reasoning, not just memorizing facts?

How they tested the model (methods explained with easy analogies)

Think of this like giving different kinds of exams to the AI:

  • Text-only medical exams: Questions where the model reads a paragraph and picks A, B, C, etc.
  • Picture + text exams: Questions where the model reads a case and also looks at medical images (like radiology scans) before choosing an answer.

Here are the main “exams” they used:

  • MedQA and MMLU-Medical: Big sets of medical multiple-choice questions covering many topics.
  • USMLE sample exams: Practice questions from the U.S. medical licensing tests (Step 1, Step 2, Step 3).
  • MedXpertQA: Very hard, expert-level questions. There are two versions: text-only and multimodal (text + medical images and test results).
  • VQA-RAD: Yes/no questions about radiology images.

How they asked the questions:

  • Zero-shot: Like giving the AI a test without showing it example answers first. No special extra training on these test sets.
  • “Chain of thought” prompting: They asked the AI to “think step by step” before giving a final answer, similar to showing your work in math class.
  • Same rules for all models: They used the same prompts and setup for GPT-4o, GPT-5, and the smaller GPT-5 versions. This makes the comparison fair—any difference is from the model, not the instructions.

How multimodal questions worked:

  • The AI received the patient story plus the image(s) in the same message, so it could connect what it read with what it saw—like a doctor combining the chart and the scan.

What they found and why it matters

Big picture:

  • GPT-5 did better than GPT-4o and the smaller GPT-5 models on almost every benchmark.
  • It was especially strong on questions that require careful, multi-step reasoning and on tasks that combine text and images.

Highlights:

  • MedQA (text-only): GPT-5 scored about 96%, higher than GPT-4o. That suggests strong medical knowledge and reasoning.
  • USMLE samples: GPT-5 averaged about 95% across Steps 1–3, with the biggest jump on Step 2 (the step focused on clinical decision-making).
  • MedXpertQA Text (hard expert-level questions): GPT-5 gained more than 25 percentage points over GPT-4o on measures of reasoning and understanding.
  • MedXpertQA Multimodal (text + images): GPT-5 jumped about 26–29 percentage points over GPT-4o, a very large improvement. On these standardized tests, GPT-5 even scored higher than pre-licensed human experts by wide margins.
  • VQA-RAD (radiology yes/no): A small exception—GPT-5 was slightly below GPT-5-mini here. The authors suggest the smaller model might fit this small, focused dataset a bit better, or that GPT-5 was more cautious.

A real-case example (in simple words):

  • The paper shows a case where a patient had severe vomiting, chest/neck crackling (air under the skin), and certain scan findings. GPT-5 correctly reasoned that this likely meant a tear in the esophagus (a dangerous emergency) and chose the right next test (a water-soluble contrast swallow). It explained why the other choices were wrong. This shows it can connect image clues with the story and labs to reach a high-stakes, sensible decision.

Why this matters:

  • Medical work is multimodal—doctors constantly combine text, numbers, and images. GPT-5’s big gains with text+images suggest it’s getting better at the kind of integrated thinking doctors do.

What this could mean and what’s next

  • Potential impact: On these controlled, test-style benchmarks, GPT-5 moves from “similar to humans” to “above human expert scores” in several areas, especially when combining text and images. That hints it could be a strong helper for clinical decision support—like a smart assistant that reads the chart, examines the scan, and offers a well-reasoned suggestion.
  • Important caution: These are standardized tests, not real hospital situations. Real care involves messy, changing information, team communication, time pressure, and ethics. A model that aces exams still needs careful checks before being used with patients.
  • What to do next: The authors recommend future work like real-world trials, better calibration, and safety measures, so the model’s advice is reliable, transparent, and used responsibly.

In short: This study shows GPT-5 is noticeably better at medical reasoning—especially when it has to blend text and images—and, on several tough benchmarks, it even beats trained human experts. That’s exciting for future medical AI tools, but careful testing and safeguards are essential before clinical use.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 47 tweets with 3457 likes about this paper.

alphaXiv