MedAgentSim: LLM-Driven Clinical Simulation

Updated 1 March 2026

MedAgentSim is a simulation framework for training, evaluating, and evolving LLM-driven agents through multi-turn clinical interactions using synthetic patient data.
It integrates modular patient, doctor, and measurement agents to simulate high-fidelity, privacy-preserving clinical scenarios with rigorous evaluation pipelines.
The platform leverages advanced techniques like self-evolution, chain-of-thought prompting, and precise metrics to benchmark diagnostic performance and system scalability.

MedAgentSim is a class of simulation environments and frameworks for training, evaluating, and evolving LLM-driven agents in healthcare, with a focus on high-fidelity, multi-agent, multi-turn clinical interactions. MedAgentSim systems combine modular patient simulators, doctor agents, measurement or environment agents, and rigorous, often fully-automated, evaluation pipelines. These platforms address the need for reproducible, scalable, and privacy-preserving clinical simulation, enabling the benchmarking and improvement of conversational and decision-making capabilities in LLM-based health agents using synthetic patients derived from real-world EHR or epidemiological data (Rashidian et al., 4 Jun 2025, Almansoori et al., 28 Mar 2025, Oh et al., 11 Dec 2025, Yu et al., 2024, Du et al., 2024, Li et al., 2024, Aghaee et al., 9 Feb 2026).

1. System Architecture and Core Agent Roles

MedAgentSim frameworks are architected as multi-agent environments, typically comprising at least three principal agent types: Patient Agent, Doctor (Clinician) Agent, and Measurement or Environment Agent. Modern implementations incorporate further modularity, adding auxiliary agents for summarization, retrieval, quality control, or environment simulation (Almansoori et al., 28 Mar 2025, Aghaee et al., 9 Feb 2026, Yu et al., 2024).

Patient Agent: Instantiated from de-identified EHR vignettes, knowledge graphs, or fused epidemiological/claims data. Internal state includes demographics, symptom timeline, comorbidities, and optionally personality traits (e.g., Big Five profile). Responses are generated using a base LLM guided by a role-constrained prompt, enforcing only-inquired detail disclosure, temporal/event consistency, and paraphrasing to prevent memorized leakage (Rashidian et al., 4 Jun 2025, Yu et al., 2024, Aghaee et al., 9 Feb 2026).
Doctor Agent: Operates with no or minimal initial knowledge, synthesizing clinical questions, ordering tests, and formulating diagnoses or triage recommendations. Architectures often use an LLM controller (“Director”) that delegates tasks to specialized sub-agents (e.g., SymptomCollector, HealthDataPlanner) (Rashidian et al., 4 Jun 2025, Oh et al., 11 Dec 2025, Almansoori et al., 28 Mar 2025).
Measurement Agent/Environment Agent: Models measurement requests (labs, imaging, vital signs) and environmental variables (psychosocial context, population-level risks). Implemented as lookup tables or procedural generative systems. Some frameworks incorporate further contextualization via EnvironmentAgent or DataAgent for longitudinal or population-informed state updates (Almansoori et al., 28 Mar 2025, Aghaee et al., 9 Feb 2026).
Orchestration: All agent dialogue and state transitions are coordinated through a structured protocol. Agents interact over shared memory, with explicit turn-taking and message schemas. Dialogue history, prior measurements, and current context are serialized for LLM input as structured prompts or context windows.

A summary of canonical agent roles and modalities:

Agent Type	Input State	Action/Output
PatientAgent	EHR/extracted vignette, internal traits	Natural language responses, symptom detail
DoctorAgent	Conversation history, measurements	Symptom queries, test orders, diagnosis
Measurement/EnvAgent	True patient/environmental state	Test results, population variable updates
Auxiliary Agents	Profile data, dialogue transcripts	Summaries, corrections, QA signals

2. Data Sources, Patient Simulation, and State Representation

Patient simulation relies on high-fidelity data extraction, preprocessing, and abstraction pipelines:

EHR-based vignette extraction: De-identified, HIPAA-compliant records are ingested, PHI is scrubbed with rule-based and LLM-assisted redaction, and clinical notes are segmented and normalized. Named-entity recognition (NER) extracts structured attributes (onset, symptom, comorbidity, lab values), which are composed into vignettes or knowledge graph nodes (Rashidian et al., 4 Jun 2025, Yu et al., 2024).
Knowledge graph construction: Entities and relations (e.g., Patient, Symptom, HAS_DURATION, Admission) are mapped into a KG (e.g., Neo4j), supporting structured retrieval and subgraph selection via agentic workflows (Yu et al., 2024).
Patient state modeling: Internal state $s_t$ includes symptom slots, demographic features, personality vectors, prior medical history, and (in advanced MAS) environmental/contextual encodings. State transitions are either deterministic or stochastic, parameterized by physiological and psychosocial parameter vectors $\theta$ (Aghaee et al., 9 Feb 2026).
Fidelity & diversity: Systems such as SynthAgent employ multi-source fusion from NHANES, claims, epidemiological surveys, and PubMed narratives to construct patient cohorts whose statistical/semantic fidelity is measured via LLM-as-judge scoring, embedding-based diversity, or KL divergence between real and synthetic feature marginals (Aghaee et al., 9 Feb 2026).

3. Multi-Agent Interaction Protocols and Dialogue Generation

Interaction proceeds as a multi-turn conversation where the Doctor Agent iteratively queries the Patient Agent and Measurement Agent, gathering data and synthesizing diagnostic reasoning:

Turn structure: The doctor selects, at each turn, to either ask a question, request a measurement, or issue a diagnostic decision. Measurement Agent responds only to explicit requests, returning true test results. Conversation history is maintained as an ordered sequence for context injection (Almansoori et al., 28 Mar 2025, Oh et al., 11 Dec 2025).
Slot filling and context encoding: Structured slots (e.g., fever: yes/no) are updated per turn, included in LLM prompts to maintain dialogue coherence and prevent contradictions (Rashidian et al., 4 Jun 2025).
Dialogue constraints: Patient Agent's prompt enforces role-appropriate language, disclosure only on inquiry, and refrains from revealing diagnosis or providing extra-clinical information. For simulated standardized patients (SPs), additional requirements include personality traits, non-disclosure of medical terminology, and behavioral nuances such as hesitation or refusals (Du et al., 2024, Aghaee et al., 9 Feb 2026).
Major enhancement mechanisms: Retrieval-Augmented Generation (RAG), chain-of-thought (CoT) prompting, ensemble majority-voting, and few-shot learning with dynamically curated exemplars or principles are prominent strategies for improving dialog quality and diagnostic accuracy (Rashidian et al., 4 Jun 2025, Almansoori et al., 28 Mar 2025, Li et al., 2024, Du et al., 2024).

4. Self-Evolution, Coevolution, and Experience Replay

A defining feature of modern MedAgentSim systems is agent evolution through simulated experience accumulation rather than supervised updates:

Self-evolution algorithms: After each diagnostic episode, Doctor Agent performance is assessed. Correctly solved cases and corresponding dialogue traces are stored in a record library; failed cases invoke self-reflective principle extraction, which are validated and stored in an experience base if they improve future performance (Li et al., 2024, Almansoori et al., 28 Mar 2025).
Coevolution: In EvoPatient-like frameworks, both patient and doctor agents are iteratively improved via stored trajectory and attention libraries. New simulations both exploit (for efficient prompting) and explore (adding novel high-quality Q-A patterns) these dynamic libraries. On each iteration, cases are validated for faithfulness and robustness, and only alignment-compliant trajectories are retained (Du et al., 2024).
Memory and retrieval: Experience replay is realized via embedding-based KNN retrieval (CLIP, text-embedding-ada-002, Jina Embeddings) over accumulated dialogue and principle libraries. Prompting for new episodes composes context from most similar historical exemplars and validated heuristics (Almansoori et al., 28 Mar 2025, Li et al., 2024).
Human-in-the-loop (optional): While these frameworks are fully automated by design, human feedback may periodically sample episodes for expert review, ensuring continued clinical alignment (Rashidian et al., 4 Jun 2025, Oh et al., 11 Dec 2025).

5. Evaluation Frameworks, Metrics, and Benchmarks

MedAgentSim platforms provide rigorous, often automated, evaluation suites for both simulated interaction quality and real-world generalization:

Core metrics: Consistency Rate (CR), Case-Summary Relevance (precision, recall, $F_1$ ), Alignment Score (AS), and inter-rater agreement (Cohen’s $\kappa$ ) (for simulation-vs-clinician concordance) (Rashidian et al., 4 Jun 2025).
Multi-dimensional metrics: AutoMedic/MedAgentSim introduces CARE, combining Conversational Accuracy ( $S_{ACC}$ $S_{A CC}$ ), Empathy ( $S_{EMP}$ $S_{EMP}$ ), Robustness ( $S_{ROB}$ $S_{ROB}$ ), and Conversational Efficiency & Strategy ( $S_{CES}$ $S_{CES}$ ). Explicit equations quantify each component, such as:
- $S_{ACC} = \text{Acc}_{Conv} \times (\text{Acc}_{Conv} / \text{Acc}_{QA})$
- $S_{CES} = (100 / N) \sum_{i=1}^N s_{CES}^i$ (Oh et al., 11 Dec 2025).
Knowledge graph extraction validity: Span-level $P$ , $R$ , $F_1$ against expert annotations; reported $F_1 = 0.89$ for GPT-4-Turbo NER (Yu et al., 2024).
Empirical benchmarks: MedAgentSim has been evaluated on NEJM, MedQA, MIMIC-IV, with accuracy increases observed through ablation: incorporation of measurement agents, memory retrieval, CoT prompting, and ensembling can yield stepwise gains up to 70.8% (MedQA, LLaMA-3.3-70B) (Almansoori et al., 28 Mar 2025).
Expert validation: Evaluation protocols involve double-blind scoring by clinicians across 14 dimensions; scoring stability is ensured with high inter-rater $\kappa$ values (0.79 and above) (Rashidian et al., 4 Jun 2025).

6. Scalability, Specialty Adaptation, and Privacy

Scale and throughput: Parallelized scenario execution enables batch simulation (e.g., ≈300 simulations/hour over 1,000 concurrent dialogues on multi-GPU clusters) (Rashidian et al., 4 Jun 2025). Modular design, data agent pipelines, and efficient embedding-based retrieval support scaling to thousands of virtual patients (Aghaee et al., 9 Feb 2026).
Specialty and population modules: Pediatric and mental health adaptations are realized through prompt modifications and specialty-specific safety rules. Auxiliary screening instruments (e.g., PHQ-9) and guardian response modules are incorporated for domain coverage (Rashidian et al., 4 Jun 2025).
Synthetic data and privacy: All patient trajectories are synthetic without re-identification risk; strict QA workflows enforce schema compliance and privacy preservation (Aghaee et al., 9 Feb 2026). Data fusion leverages NHANES, BRFSS, claims, and literature, with differential privacy extensions possible in future iterations.

7. Limitations and Prospective Directions

Current MedAgentSim frameworks acknowledge several open challenges:

Nonverbal behavior simulation: Absence of tone, prosody, and facial cues; LLM-driven SPs may only approximate human nuance (Du et al., 2024).
Chain-of-thought and multi-step reasoning: While CoT and chain-of-thought self-reflection improve faithfulness and robustness, nuanced errors (e.g., “cheat” queries) and context misinterpretations persist (Du et al., 2024, Almansoori et al., 28 Mar 2025).
Library hygiene in self-evolving systems: Filtering low-quality or misaligned dialogue trajectories remains imperfect, despite correction mechanisms (Du et al., 2024).
Evaluation generalizability: Automated evaluation remains partially dependent on LLM and expert scoring. Statistical significance tests, confidence intervals, and ablation across diverse datasets are standard best practices (Rashidian et al., 4 Jun 2025, Oh et al., 11 Dec 2025).

A plausible implication is that as simulation fidelity, multimodal integration, and agentic reasoning mature, MedAgentSim will underpin not only model development and validation but also medical education, system integration, and hypothesis-driven health systems simulation (Rashidian et al., 4 Jun 2025, Aghaee et al., 9 Feb 2026, Yu et al., 2024).