EHRWorld-110K: Patient-Centric Clinical Trajectories
- EHRWorld-110K is a patient-centric, longitudinal clinical dataset derived from MIMIC-IV, featuring over 110K de-identified hospitalization episodes.
- The dataset captures detailed static and time-stamped clinical events including inquiries and interventions, enabling robust simulation and counterfactual analysis.
- Rigorous data cleaning, normalization, and statistical filtering ensure high-quality trajectories for advancing sequential modeling and world simulation research.
EHRWorld-110K is a large-scale, patient-centric longitudinal clinical dataset derived from the MIMIC-IV electronic health record (EHR) database, designed explicitly to facilitate research on temporal modeling, simulation, and counterfactual reasoning in the context of hospitalization episodes. It comprises over 110,000 de-identified hospitalization trajectories with richly structured, time-stamped clinical events and static annotations, tailored for the development and evaluation of medical world models under long-horizon, intervention-driven dynamics (Mu et al., 3 Feb 2026).
1. Cohort Definition and Dataset Construction
EHRWorld-110K construction proceeds in three explicit stages:
- Patient cohort selection targets all adult (age ≥ 18) inpatient stays within MIMIC-IV (2008–2019) containing at least one medication administration and at least one laboratory or vital-sign event. Exclusions are applied to:
- Admissions with zero therapy events,
- Episodes with total event count below the 10th percentile or above the 90th percentile (to suppress uninformative and outlier stays),
- Administrative duplicates and incomplete records.
- Time span and observation window represent each episode as a trajectory from admission () to discharge (), omitting episodes under one day or above the 90th-percentile length of stay (approximately 100 days). There is no imposed maximal episode length beyond data-driven filtering.
- Scale: The resultant dataset comprises hospitalization episodes and approximately million individual, timestamped events. Episodes correspond to de-identified patients (grouped by subject_id), each potentially contributing multiple visits.
Expressed mathematically, with as the number of visits for patient :
and with the number of events in episode :
2. Structure and Content of Each Trajectory
Each of the 110,513 episodes is encoded as a JSON object containing both static and longitudinal information:
- Static attributes (episode-level context) include:
- Demographics: age (in years), gender, allergy history.
- Hierarchical diagnoses: one "Primary Diagnosis" and an ordered set of "Secondary Diagnoses," each annotated with free-text "Content" and an associated "Reason" (both extracted from discharge summaries).
- Time-indexed event sequence: for (admission to discharge), a chronologically ordered list captures:
- Inquiry events (): laboratory tests, vital signs, physical examination findings. Each is recorded as
1
{"code": <test_code>, "timestamp": τ_t, "value": <numeric_or_categorical>, "units": <units>}- Intervention events (): medication administrations or procedures, formatted as
1
{"code": <med_code_or_proc_code>, "timestamp": τ_t, "dose": <dosage>, "route": <route>}- All events possess explicit raw timestamps (), absolute or relative to admission, and are grouped as .
The schematic JSON structure for an episode is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
{
"static": {
"age": ...,
"gender": ...,
"allergies": ...,
"diagnoses": {
"primary": {"content": ..., "reason": ...},
"secondary": [{ "content": ..., "reason": ... }, ...]
}
},
"events": [
{ "timestamp": τ_1, "inquiries": [ ... ], "interventions": [ ... ] },
...
]
} |
3. Annotation Workflow and Quality Control
Key steps in processing and annotation for EHRWorld-110K are as follows:
- Static information extraction: LLM (Qwen3-235B-Instruct) parses discharge summaries via prompts enforcing a rigid JSON schema. Only diagnoses actively treated during the indexed hospitalization are extracted; coincidental or historical comorbidities are omitted.
- Event cleaning: Administrative duplicates and entries lacking essential fields are removed. Only log entries with valid execution timestamps and values (for inquiries or interventions) are retained.
- Normalization and missingness: Laboratory and vital sign measurements are preserved in native clinical units; no global normalization (e.g., z-scoring) is performed during data release. Episodes with missing allergies or demographic attributes are excluded.
- Statistical filtering: To promote stability in downstream modeling, stays with length of stay (LOS) or event count outside the empirical 10th–90th percentile range are removed.
- Patient-centric data split: Stratified at subject_id, ensuring all visits for a given patient fall exclusively in either training or test partitions. The held-out test set contains 579 episodes ( 0.5% of all data), spanning 1,043 distinct primary/secondary diagnostic profiles.
4. Statistical Properties and Distributions
Key empirical statistics (expressed in LaTeX) characterize the EHRWorld-110K corpus:
- Age:
- Age distribution ranges from 18 to 90+, peaking at 50–60 years.
- Length of Stay (LOS):
- For episode , LOS in days is .
- Max LOS days.
- Event Counts per Episode:
- For episode , .
- Distribution is right-skewed: some episodes exceed 1,000 events.
- Event Intensity (per-day event rate):
- Defined as , averaging events/day, with spectrum from to (lower in routine, higher in intensive care episodes).
5. Intended Use, Scope, and Limitations
EHRWorld-110K is released solely for research into longitudinal patient-level modeling and simulation. The following guidelines and caveats apply:
- Privacy and compliance: All patient information is de-identified under the MIMIC-IV IRB protocol. The risk of re-identification is nominal; however, use requires compliance with the MIMIC data use agreement.
- Limitations:
- Autoregressive modeling approaches may accumulate errors over extended trajectory rollouts.
- The dataset omits unstructured clinical narrative text apart from discharge summaries (i.e., progress notes and clinician free-text are excluded).
- No explicit outcome labels (e.g., mortality or readmission) are provided; EHRWorld-110K represents "pure" trajectory data, not a supervised prediction corpus.
- Biases: The underlying population reflects the demographic and care patterns of a US academic hospital system, potentially encoding and perpetuating those patterns in simulated outputs.
- Recommended applications: The dataset is suitable for:
- Training and evaluating sequential models mapping (time, action) pairs to resulting observations,
- Supporting counterfactual policy-learning research if users define their own reward functions,
- Model development, benchmarking, and ablation studies in world modeling and simulation.
- It is not intended for prospective deployment in live clinical decision support without further validation.
6. Context and Implications for World Modeling
EHRWorld-110K was developed to address shortcomings in applying traditional LLMs to dynamic, temporally evolving clinical simulation. Evaluations demonstrate that sequence models trained directly on causally grounded, longitudinal observation/intervention data—such as that provided by EHRWorld-110K—exhibit greater stability, more consistent modeling of rare or clinically significant events, and superior simulation fidelity over long horizons compared to LLMs that rely solely on static medical knowledge (Mu et al., 3 Feb 2026). A plausible implication is that high-fidelity, patient-centric trajectory datasets are indispensable for robust, trustworthy clinical world modeling in sequential settings.