The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR

Published 29 Jan 2026 in cs.AI, cs.CE, and q-bio.QM | (2601.22128v1)

Abstract: LLMs trained with next-word-prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next-token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient's trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large-scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient-years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB-Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SMB-Structure, a novel framework that models patient EHRs as dynamical systems to capture longitudinal treatment trajectories.
It combines supervised fine-tuning over clinical tokens with JEPA-based latent prediction and momentum encoding to forecast future patient states.
Empirical results demonstrate enhanced AUC-ROC performance on long-horizon tasks, validating its potential for robust clinical risk stratification.

Modeling Patients as Dynamical Systems: SMB-Structure Paradigm for Longitudinal EHR Embedding

Motivation and Context

The dominant paradigm in clinical foundation modeling has historically relied on next-token prediction over clinical text and structured codes, implicitly treating patient records as static documents to be summarized. While such autoregressive LLMs demonstrate strong downstream performance on discriminative clinical tasks via linear probing, their objective and resulting representations are agnostic to the temporal and intervention-driven dynamics intrinsic to disease progression. This distinction is critical: clinical decision-making requires forecasting the evolving patient state—not simply regurgitating distributional regularities in documentation.

SMB-Structure reframes this problem by introducing a world model approach for EHR modeling, inspired by successful applications of Joint-Embedding Predictive Architectures (JEPA) in vision and language domains. Here, the central claim is that explicit trajectory modeling—forcing encoders to anticipate latent patient states along the timeline—is necessary for robust forecasting under high patient heterogeneity, surpassing the limitations of token-level reconstruction.

Model Architecture

SMB-Structure integrates two complementary objectives: supervised fine-tuning (SFT) over clinical tokens, ensuring semantic grounding, and JEPA-based latent prediction, which compels the encoder to simulate future patient states in representation space. This configuration comprises:

Clinical Tokenization: Fine-grained delimiter tokens are added to segment demographic, condition, measurement, procedure, medication, and note fields within EHR sequences, improving backbone structure-awareness.
Predictor: A bottleneck MLP refines encoder representations, tasked with predicting masked future state embeddings.
Momentum Encoder: An EMA copy of the encoder, supplying stable targets for JEPA alignment and preventing representation collapse.
Figure 1: Architecture for SMB-Structure for Time-to-Event EHR Modeling.

Unlike simple autoregressive models, SMB-Structure's JEPA objective masks a fraction of the continuation (future state) tokens, requiring the model to predict their embeddings in latent space without direct access, thus encoding trajectory dynamics at each step.

Evaluation Methodology

The evaluation protocol emphasizes trajectories. The Point-in-Time framework slices patient histories into decision nodes (e.g., therapy initiation, progression confirmation), providing context embeddings before critical clinical junctures. Frozen SMB-Structure representations are then linearly probed for 68 downstream tasks spanning progression, toxicity, treatment durability, response, and survival across two large-scale cohorts: MSK Oncology ( $>$ 23,000 patients, $>$ 323,000 patient-years) and INSPECT Pulmonary Embolism ( $>$ 19,000 patients).

Figure 2: Evaluation Framework of Foundation Models for Time-to-Event EHR.

Data splits, time windows, and leakage-prevention strategies ensure temporal validity and that performance reflects genuine representation quality rather than task-specific finetuning or exploitative overfitting.

Empirical Results

SMB-Structure consistently demonstrates enhanced performance in trajectory-sensitive tasks, particularly those demanding long-range extrapolation. Adding INSPECT to MSK (curriculum training) yields significant AUC-ROC improvements for hybrid JEPA-SFT objectives versus SFT-only, substantiating the claim that trajectory diversity regularizes latent dynamics learning. Short-horizon tasks (e.g., 30-day readmission) see modest gains, while long-horizon tasks like 365-day mortality registers more pronounced improvements over autoregressive baselines.

Figure 3: Model performance by oncology indication on MSK cohort.

Notably, curriculum training—separating semantic grounding (SFT) from dynamical modeling (JEPA)—outperforms joint Hybrid optimization, highlighting objective interference: without semantic bootstrapping, the gradients from SFT and JEPA objectives conflict, preventing the encoder from capturing either surface regularities or meaningful dynamics.

Ablations reveal performance optima for predictor complexity (2-layer, width = LLM hidden dimension), equal SFT-JEPA loss weighting, and a $50\%$ masking ratio—validating the requirement of a well-tuned information bottleneck for effective latent trajectory modeling.

Figure 4: Model performance benchmark on Lung Cancer.

Theoretical and Practical Implications

This work establishes the necessity to model patients as dynamical systems, not documents. By integrating world model strategies (JEPA) with clinical semantic grounding (SFT), SMB-Structure enables encoders to capture disease momentum: not only what the patient is, but where they are going under intervention and time.

Practically, SMB-Structure yields frozen embeddings with superior linear probe performance across diverse prediction tasks, robust under patient heterogeneity and disease complexity. The model facilitates accurate long-horizon forecasts, critical for proper clinical risk stratification, intervention planning, and resource allocation.

Theoretically, the paradigm demonstrates that latent space simulation—when grounded by clinical semantics—can encode physiological change patterns reusable across disease domains, a capability unattainable by pure autoregressive models.

Limitations and Future Directions

SMB-Structure is computationally intensive, requiring dual forward passes and momentum target maintenance. The study restricts evaluation to linear probes, not fully mapping the upper bounds of representation utility under more expressive decoders. Model generalization outside MSK and INSPECT populations remains uncertain.

Next steps include conditioning JEPA objectives on interventions to support counterfactual reasoning, expanding evaluation to nonlinear probing, and integrating fairness auditing for deployment-readiness. Extending world models to patient simulation under varied treatment paths promises a robust foundation for AI-driven clinical trial design, policy evaluation, and truly actionable precision medicine.

Conclusion

SMB-Structure advances longitudinal EHR modeling by enforcing latent trajectory simulation on top of semantic grounding. This paradigm shift is necessary to encode the “direction and velocity” of patient health status for clinically meaningful downstream applications. Separation of SFT for language understanding from JEPA for dynamic abstraction yields robust, actionable representations, laying the groundwork for more causal, generalizable, and intervention-aware clinical AI systems.

Markdown Report Issue