Papers
Topics
Authors
Recent
Search
2000 character limit reached

EHRWorld-110K: Patient-Centric Clinical Trajectories

Updated 10 February 2026
  • EHRWorld-110K is a patient-centric, longitudinal clinical dataset derived from MIMIC-IV, featuring over 110K de-identified hospitalization episodes.
  • The dataset captures detailed static and time-stamped clinical events including inquiries and interventions, enabling robust simulation and counterfactual analysis.
  • Rigorous data cleaning, normalization, and statistical filtering ensure high-quality trajectories for advancing sequential modeling and world simulation research.

EHRWorld-110K is a large-scale, patient-centric longitudinal clinical dataset derived from the MIMIC-IV electronic health record (EHR) database, designed explicitly to facilitate research on temporal modeling, simulation, and counterfactual reasoning in the context of hospitalization episodes. It comprises over 110,000 de-identified hospitalization trajectories with richly structured, time-stamped clinical events and static annotations, tailored for the development and evaluation of medical world models under long-horizon, intervention-driven dynamics (Mu et al., 3 Feb 2026).

1. Cohort Definition and Dataset Construction

EHRWorld-110K construction proceeds in three explicit stages:

  • Patient cohort selection targets all adult (age ≥ 18) inpatient stays within MIMIC-IV (2008–2019) containing at least one medication administration and at least one laboratory or vital-sign event. Exclusions are applied to:
    • Admissions with zero therapy events,
    • Episodes with total event count below the 10th percentile or above the 90th percentile (to suppress uninformative and outlier stays),
    • Administrative duplicates and incomplete records.
  • Time span and observation window represent each episode as a trajectory from admission (Ï„1\tau_1) to discharge (Ï„T\tau_T), omitting episodes under one day or above the 90th-percentile length of stay (approximately 100 days). There is no imposed maximal episode length beyond data-driven filtering.
  • Scale: The resultant dataset comprises Nvisits=110,513N_{visits} = 110,513 hospitalization episodes and approximately Nevents=17.5N_{events} = 17.5 million individual, timestamped events. Episodes correspond to de-identified patients (grouped by subject_id), each potentially contributing multiple visits.

Expressed mathematically, with TiT_i as the number of visits for patient ii:

Nvisits=∑i=1NpatTi=110,513N_{visits} = \sum_{i=1}^{N_{pat}} T_i = 110,513

and with EjE_j the number of events in episode jj:

Nevents=∑j=1110,513Ej≈17.5×106N_{events} = \sum_{j=1}^{110,513} E_j \approx 17.5 \times 10^6

2. Structure and Content of Each Trajectory

Each of the 110,513 episodes is encoded as a JSON object containing both static and longitudinal information:

  • Static attributes (episode-level context) include:
    • Demographics: age (in years), gender, allergy history.
    • Hierarchical diagnoses: one "Primary Diagnosis" and an ordered set of "Secondary Diagnoses," each annotated with free-text "Content" and an associated "Reason" (both extracted from discharge summaries).
  • Time-indexed event sequence: for t=1…Tt = 1 \ldots T (admission to discharge), a chronologically ordered list captures:
    • Inquiry events (AinqA_{inq}): laboratory tests, vital signs, physical examination findings. Each is recorded as
    • 1
      
      {"code": <test_code>, "timestamp": τ_t, "value": <numeric_or_categorical>, "units": <units>}
    • Intervention events (AintA_{int}): medication administrations or procedures, formatted as
    • 1
      
      {"code": <med_code_or_proc_code>, "timestamp": τ_t, "dose": <dosage>, "route": <route>}
    • All events possess explicit raw timestamps (Ï„t\tau_t), absolute or relative to admission, and are grouped as At=Ainq∪AintA_t = A_{inq} \cup A_{int}.

The schematic JSON structure for an episode is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  "static": {
    "age": ...,
    "gender": ...,
    "allergies": ...,
    "diagnoses": {
      "primary": {"content": ..., "reason": ...},
      "secondary": [{ "content": ..., "reason": ... }, ...]
    }
  },
  "events": [
    { "timestamp": τ_1, "inquiries": [ ... ], "interventions": [ ... ] },
    ...
  ]
}

3. Annotation Workflow and Quality Control

Key steps in processing and annotation for EHRWorld-110K are as follows:

  • Static information extraction: LLM (Qwen3-235B-Instruct) parses discharge summaries via prompts enforcing a rigid JSON schema. Only diagnoses actively treated during the indexed hospitalization are extracted; coincidental or historical comorbidities are omitted.
  • Event cleaning: Administrative duplicates and entries lacking essential fields are removed. Only log entries with valid execution timestamps and values (for inquiries or interventions) are retained.
  • Normalization and missingness: Laboratory and vital sign measurements are preserved in native clinical units; no global normalization (e.g., z-scoring) is performed during data release. Episodes with missing allergies or demographic attributes are excluded.
  • Statistical filtering: To promote stability in downstream modeling, stays with length of stay (LOS) or event count outside the empirical 10th–90th percentile range are removed.
  • Patient-centric data split: Stratified at subject_id, ensuring all visits for a given patient fall exclusively in either training or test partitions. The held-out test set contains 579 episodes (≈\approx 0.5% of all data), spanning 1,043 distinct primary/secondary diagnostic profiles.

4. Statistical Properties and Distributions

Key empirical statistics (expressed in LaTeX) characterize the EHRWorld-110K corpus:

  • Age:
    • μage=1Npat∑i=1Npatagei≈58.0 years\mu_{age} = \frac{1}{N_{pat}} \sum_{i=1}^{N_{pat}} age_i \approx 58.0~\text{years}
    • σage≈19.2 years\sigma_{age} \approx 19.2~\text{years}
    • Age distribution ranges from 18 to 90+, peaking at 50–60 years.
  • Length of Stay (LOS):
    • For episode jj, LOS in days is â„“j\ell_j.
    • μLOS=1Nvisits∑j=1Nvisitsâ„“j≈5.2 days\mu_{LOS} = \frac{1}{N_{visits}} \sum_{j=1}^{N_{visits}} \ell_j \approx 5.2~\text{days}
    • σLOS≈4.8 days\sigma_{LOS} \approx 4.8~\text{days}
    • Max LOS >100> 100 days.
  • Event Counts per Episode:
    • For episode jj, ej=∣Ej∣e_j = |\mathbf{E}_j|.
    • μe=1Nvisits∑j=1Nvisitsej≈158.5\mu_e = \frac{1}{N_{visits}} \sum_{j=1}^{N_{visits}} e_j \approx 158.5
    • σe≈115.3\sigma_e \approx 115.3
    • Distribution is right-skewed: some episodes exceed 1,000 events.
  • Event Intensity (per-day event rate):
    • Defined as ejâ„“j\frac{e_j}{\ell_j}, averaging ≈30\approx 30 events/day, with spectrum from <10<10 to >200>200 (lower in routine, higher in intensive care episodes).

5. Intended Use, Scope, and Limitations

EHRWorld-110K is released solely for research into longitudinal patient-level modeling and simulation. The following guidelines and caveats apply:

  • Privacy and compliance: All patient information is de-identified under the MIMIC-IV IRB protocol. The risk of re-identification is nominal; however, use requires compliance with the MIMIC data use agreement.
  • Limitations:
    • Autoregressive modeling approaches may accumulate errors over extended trajectory rollouts.
    • The dataset omits unstructured clinical narrative text apart from discharge summaries (i.e., progress notes and clinician free-text are excluded).
    • No explicit outcome labels (e.g., mortality or readmission) are provided; EHRWorld-110K represents "pure" trajectory data, not a supervised prediction corpus.
  • Biases: The underlying population reflects the demographic and care patterns of a US academic hospital system, potentially encoding and perpetuating those patterns in simulated outputs.
  • Recommended applications: The dataset is suitable for:
    • Training and evaluating sequential models mapping (time, action) pairs to resulting observations,
    • Supporting counterfactual policy-learning research if users define their own reward functions,
    • Model development, benchmarking, and ablation studies in world modeling and simulation.
    • It is not intended for prospective deployment in live clinical decision support without further validation.

6. Context and Implications for World Modeling

EHRWorld-110K was developed to address shortcomings in applying traditional LLMs to dynamic, temporally evolving clinical simulation. Evaluations demonstrate that sequence models trained directly on causally grounded, longitudinal observation/intervention data—such as that provided by EHRWorld-110K—exhibit greater stability, more consistent modeling of rare or clinically significant events, and superior simulation fidelity over long horizons compared to LLMs that rely solely on static medical knowledge (Mu et al., 3 Feb 2026). A plausible implication is that high-fidelity, patient-centric trajectory datasets are indispensable for robust, trustworthy clinical world modeling in sequential settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EHRWorld-110K.