Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAQuE: Medical Agent Questioning Evaluation

Updated 22 February 2026
  • MAQuE is a benchmark that systematically evaluates multi-turn medical inquiry by simulating 3,000+ patient profiles across 21 specialties.
  • It employs granular assessment metrics, including atomic disclosure control and linguistic variation, to mimic realistic patient interactions.
  • The framework highlights trade-offs between diagnostic accuracy, inquiry efficiency, and empathetic communication in current LLM-based clinical agents.

MAQuE (Medical Agent Questioning Evaluation) is the largest benchmark for systematic, automatic, and fine-grained assessment of multi-turn medical inquiry in AI-powered doctor agents. Spanning 3,000+ simulated patient agents designed with high behavioral fidelity, MAQuE offers an empirically grounded, multi-dimensional evaluation framework that rigorously probes the inquiry, dialogue, and empathetic competence of contemporary LLM-based clinical agents across 21 medical specialties. Its comprehensive scope and detailed metric design reveal fundamental limitations, trade-offs, and practical challenges for real-world deployment of AI doctor agents (Gong et al., 29 Sep 2025).

1. Construction and Scope of the MAQuE Benchmark

MAQuE comprises approximately 3,000 simulated patient agents sourced and augmented from five principal corpora: MedQA (USMLE-style clinical questions; 1,257 cases), Craft-MD (online bank; 140 cases), DiagnosisArena (journal cases; 915 cases), AgentClinic-NEJM (92 cases), and the synthetic Patient-Zero set (420 cases). This aggregation yields 2,824 real and 420 synthetic cases, mapped to 21 medical specialties with roughly uniform distribution after augmentation.

Each patient profile is annotated with an average of 23.75 Atomic Information Units (AIUs)—indivisible, non-overlapping clinical facts—enabling granular control over what information is available for elicitation in each dialogue round. Patient agents are governed by three core design principles:

  • Atomic Disclosure Control: Restricts each turn to at most k=3k=3 disclosed AIUs, enforcing a realistic information bottleneck and preventing over-disclosure.
  • Linguistic Variation: Randomly assigns paraphrase styles (“colloquial,” “vague,” “direct”) to patient utterances.
  • Noise Injection: Emulates patient-side cognitive limitations (imprecise recall, misunderstanding), fluctuating emotional states (anxiety, frustration, 5-level intensity), and stochastic vague rewriting.

All behavioral variation is seeded deterministically per patient ID, ensuring full reproducibility and facilitating robust ablation studies.

2. Multi-Dimensional Evaluation Framework

MAQuE’s evaluation framework quantifies a doctor agent’s inquiry quality along five orthogonal axes:

2.1 Task Success (TS)

Diagnostic accuracy (Acc\mathrm{Acc}) measures the fraction of correct final diagnoses, while specialty robustness (SrobustS_\mathrm{robust}) penalizes large disparities between specialties:

Srobust=1σmax(μ+σ,  ϵ),ϵ=103S_\mathrm{robust} = 1 - \frac{\sigma}{\max(\mu + \sigma, \;\epsilon)},\quad \epsilon=10^{-3}

The final score is TS=(Acc+Srobust)/2TS = (\mathrm{Acc} + S_\mathrm{robust})/2.

2.2 Inquiry Proficiency (IP)

Evaluates relevance and breadth of elicited AIUs:

  • Coverage: E/A|E|/|A|, where EE is the set of elicited AIUs, AA the superset of target AIUs.
  • Relevance: Proportion of agent queries that meaningfully elicit a target AIU, averaged across all questions.

2.3 Dialogue Competence (DC)

  • Adherence: LLM-judge–scored (1–5, normalized) measure of role consistency—no premature diagnosis, no bulk-questioning, no explicit disclaimers.
  • Coherence: Normalized penalty for contradictions, repetition, and illogical conversational flow.

2.4 Inquiry Efficiency (IE)

  • Questions: Mean number of queries per session (lower is better).
  • Tokens: Cumulative token usage per session (in thousands; lower is better).

2.5 Patient Experience (PE)

  • Clarity: Are questions concise, unambiguous, and free of medical jargon? (LLM-judge score)
  • Empathy: Degree of demonstrable care, reassurance, and emotional resonation (LLM-judge score).

Wherever possible, MAQuE applies automated scoring; otherwise, an LLM-as-judge rubric is used, with scores normalized to [0,1][0,1]. LLM-judge evaluations were validated by strong Pearson correlations with human annotation: 0.916 (adherence), 0.846 (coherence), 0.664 (clarity), and 0.995 (empathy).

3. Dataset Creation, Annotation, and Agent Simulation

MAQuE’s patient profiles are generated by stringent pre-filtering of public diagnostic case sources, followed by department classification using GPT-4o prompting. AIU extraction is performed by LLMs guided by prompts that ensure atomistic, non-overlapping fact definition. Synthetic patient augmentation via Patient-Zero ensures representation across all 21 specialties, producing a distribution of ~3,000 cases as detailed in supplementary Figure A.2 (Gong et al., 29 Sep 2025).

Patient agent generation comprises three layers:

  1. Probabilistic disclosure of AIUs with strict kk-cap per round.
  2. Randomized linguistic style through paraphrase templates.
  3. Systematic noise modalities: cognitive distortion, emotional state modulation, and utterance vagueness—enforceable and reproducible at the agent level.

4. Empirical Benchmarking of Doctor Agents

Performance on MAQuE reveals substantial gaps in current LLMs’ inquiry and dialogue capabilities. Baseline evaluations establish a lower bound: “chief-complaint only” (Acc = 0.404, Robust = 0.769) and an upper bound: “oracle full profile” (Acc = 0.852, Robust = 0.916). Prominent closed-source models show significant variability:

Model TS.Acc Cov Rel Adh Coh #Q #Tok (k) Clar Emp
GPT-4o 0.692 0.374 0.890 0.962 0.821 9.63 0.18 0.792 0.522
GPT-5-Chat 0.684 0.302 0.919 0.991 0.828 8.66 0.19 0.703 0.458
Gemini-2.5-Pro 0.672 0.288 0.840 0.964 0.873 6.70 11.31 0.836 0.669
Claude-Sonnet-4 0.662 0.385 0.947 0.886 0.888 9.67 0.48 0.785 0.774

Domain-specific and open-source LLMs (e.g., DeepSeek-V3: Acc = 0.555, Cov = 0.226; Baichuan-M2-32B: Acc = 0.578, Cov = 0.338) consistently underperform relative to closed-source LLMs both in diagnostic accuracy and AIU coverage.

Even after 20 dialogue rounds, GPT-4o achieves less than 50% coverage of a patient’s AIUs, demonstrating clear limitations in current inquiry practices.

5. Trade-Offs, Robustness, and Sensitivity to Patient Realism

MAQuE exposes persistent trade-offs between evaluation axes:

  • Accuracy-Efficiency: Higher accuracy generally requires increased questioning and greater token usage (e.g., GPT-4o vs. Gemini-2.5-Pro’s 11k tokens per session).
  • Empathy-Performance Decoupling: No systematic correlation between empathy and diagnostic success; Claude-Sonnet-4 exhibits the highest empathy but not the highest accuracy.
  • Coverage-Fatigue Tension: Pursuing deeper inquiry (↑coverage) increases the number of questions and risk of patient fatigue.

Patient-agent realism strongly impacts model performance. Incrementally adding to patient realism—atomic disclosure, linguistic variation, and noise injection—causes a monotonic decrease in diagnostic accuracy and AIU coverage, while increasing session length and token consumption. Notably, noise injection raises both patient-perceived empathy and clarity but lowers task performance:

Simulation Setting Acc Cov #Q #Tok (k) Clar Emp
Basic 0.576 0.513 8.77 0.18 0.754 0.415
+Disclosure Control 0.568 0.438 8.85 0.19 0.746 0.414
+Linguistic Variation 0.520 0.397 9.30 0.20 0.771 0.434
+Noise Injection 0.514 0.395 9.25 0.31 0.767 0.717

Stepwise analysis confirms that each increment toward behavioral realism—especially noise—increases conversational “cost” while eroding diagnostic accuracy and coverage.

6. Practical Implications and Future Directions

MAQuE demonstrates that SOTA LLM-based doctor agents remain highly sensitive to simulated patient realism. Inquiry efficiency, depth versus fatigue, and the lack of correlation between empathy and clinical accuracy remain open engineering and scientific challenges.

The results indicate that effective deployment of clinical AI agents requires multi-objective reward design attuned to both process (e.g., dialogue quality) and outcome (e.g., diagnosis), as well as strategies for curriculum learning with diverse, realistic simulators. Robust inquiry protocols must explicitly anticipate and manage trade-offs between accuracy, efficiency, and patient experience (Gong et al., 29 Sep 2025).

A plausible implication is that future agent architectures will need dominant reward structures or policy optimizations that explicitly integrate patient experience, inquiry efficiency, and stability to behavioral variance. MAQuE’s fine-grained and reproducible benchmarking provides a uniquely comprehensive platform for assessing such developments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MAQuE.