- The paper introduces SMB-Structure, a novel framework that models patient EHRs as dynamical systems to capture longitudinal treatment trajectories.
- It combines supervised fine-tuning over clinical tokens with JEPA-based latent prediction and momentum encoding to forecast future patient states.
- Empirical results demonstrate enhanced AUC-ROC performance on long-horizon tasks, validating its potential for robust clinical risk stratification.
Modeling Patients as Dynamical Systems: SMB-Structure Paradigm for Longitudinal EHR Embedding
Motivation and Context
The dominant paradigm in clinical foundation modeling has historically relied on next-token prediction over clinical text and structured codes, implicitly treating patient records as static documents to be summarized. While such autoregressive LLMs demonstrate strong downstream performance on discriminative clinical tasks via linear probing, their objective and resulting representations are agnostic to the temporal and intervention-driven dynamics intrinsic to disease progression. This distinction is critical: clinical decision-making requires forecasting the evolving patient state—not simply regurgitating distributional regularities in documentation.
SMB-Structure reframes this problem by introducing a world model approach for EHR modeling, inspired by successful applications of Joint-Embedding Predictive Architectures (JEPA) in vision and language domains. Here, the central claim is that explicit trajectory modeling—forcing encoders to anticipate latent patient states along the timeline—is necessary for robust forecasting under high patient heterogeneity, surpassing the limitations of token-level reconstruction.
Model Architecture
SMB-Structure integrates two complementary objectives: supervised fine-tuning (SFT) over clinical tokens, ensuring semantic grounding, and JEPA-based latent prediction, which compels the encoder to simulate future patient states in representation space. This configuration comprises:
Unlike simple autoregressive models, SMB-Structure's JEPA objective masks a fraction of the continuation (future state) tokens, requiring the model to predict their embeddings in latent space without direct access, thus encoding trajectory dynamics at each step.
Evaluation Methodology
The evaluation protocol emphasizes trajectories. The Point-in-Time framework slices patient histories into decision nodes (e.g., therapy initiation, progression confirmation), providing context embeddings before critical clinical junctures. Frozen SMB-Structure representations are then linearly probed for 68 downstream tasks spanning progression, toxicity, treatment durability, response, and survival across two large-scale cohorts: MSK Oncology (>23,000 patients, >323,000 patient-years) and INSPECT Pulmonary Embolism (>19,000 patients).
Figure 2: Evaluation Framework of Foundation Models for Time-to-Event EHR.
Data splits, time windows, and leakage-prevention strategies ensure temporal validity and that performance reflects genuine representation quality rather than task-specific finetuning or exploitative overfitting.
Empirical Results
SMB-Structure consistently demonstrates enhanced performance in trajectory-sensitive tasks, particularly those demanding long-range extrapolation. Adding INSPECT to MSK (curriculum training) yields significant AUC-ROC improvements for hybrid JEPA-SFT objectives versus SFT-only, substantiating the claim that trajectory diversity regularizes latent dynamics learning. Short-horizon tasks (e.g., 30-day readmission) see modest gains, while long-horizon tasks like 365-day mortality registers more pronounced improvements over autoregressive baselines.
Figure 3: Model performance by oncology indication on MSK cohort.
Notably, curriculum training—separating semantic grounding (SFT) from dynamical modeling (JEPA)—outperforms joint Hybrid optimization, highlighting objective interference: without semantic bootstrapping, the gradients from SFT and JEPA objectives conflict, preventing the encoder from capturing either surface regularities or meaningful dynamics.
Ablations reveal performance optima for predictor complexity (2-layer, width = LLM hidden dimension), equal SFT-JEPA loss weighting, and a 50% masking ratio—validating the requirement of a well-tuned information bottleneck for effective latent trajectory modeling.
Figure 4: Model performance benchmark on Lung Cancer.
Theoretical and Practical Implications
This work establishes the necessity to model patients as dynamical systems, not documents. By integrating world model strategies (JEPA) with clinical semantic grounding (SFT), SMB-Structure enables encoders to capture disease momentum: not only what the patient is, but where they are going under intervention and time.
Practically, SMB-Structure yields frozen embeddings with superior linear probe performance across diverse prediction tasks, robust under patient heterogeneity and disease complexity. The model facilitates accurate long-horizon forecasts, critical for proper clinical risk stratification, intervention planning, and resource allocation.
Theoretically, the paradigm demonstrates that latent space simulation—when grounded by clinical semantics—can encode physiological change patterns reusable across disease domains, a capability unattainable by pure autoregressive models.
Limitations and Future Directions
SMB-Structure is computationally intensive, requiring dual forward passes and momentum target maintenance. The study restricts evaluation to linear probes, not fully mapping the upper bounds of representation utility under more expressive decoders. Model generalization outside MSK and INSPECT populations remains uncertain.
Next steps include conditioning JEPA objectives on interventions to support counterfactual reasoning, expanding evaluation to nonlinear probing, and integrating fairness auditing for deployment-readiness. Extending world models to patient simulation under varied treatment paths promises a robust foundation for AI-driven clinical trial design, policy evaluation, and truly actionable precision medicine.
Conclusion
SMB-Structure advances longitudinal EHR modeling by enforcing latent trajectory simulation on top of semantic grounding. This paradigm shift is necessary to encode the “direction and velocity” of patient health status for clinically meaningful downstream applications. Separation of SFT for language understanding from JEPA for dynamic abstraction yields robust, actionable representations, laying the groundwork for more causal, generalizable, and intervention-aware clinical AI systems.