Clinical Prediction Models (CPMs)

Updated 20 January 2026

Clinical prediction models are algorithmic tools that integrate multiple predictors to estimate individual risk and guide patient management.
They use adaptive strategies such as sequential sample size determination, dynamic updating, and robust imputation to ensure model reliability.
Advanced methods incorporate human-AI co-design, ontology-aware deep learning, and multi-outcome modeling to enhance interpretability and generalizability.

Clinical prediction models (CPMs) are mathematical or algorithmic tools designed to estimate an individual’s risk of a particular clinical outcome, conditional on multiple predictors. CPMs play a pivotal role in stratifying patients, guiding clinical management, allocating resources, and informing shared decision making across a wide range of healthcare settings. Methods for CPM development, evaluation, and deployment must rigorously address challenges of sample size determination, model instability, updating in dynamic environments, handling of missing data, generalizability, and clinical utility to ensure robust and trustworthy predictions for individual patients.

1. Sequential Sample Size Determination, Instability, and Learning Curves

Traditional CPM development relies on a fixed a priori sample size calculated from assumptions about predictor-outcome relationships and event prevalence. This approach is vulnerable to misspecification, resulting in overfitting or insufficient precision. Modern best practices leverage an adaptive sequential sample-size strategy using learning curves and explicit stopping rules based on both population- and individual-level prediction stability (Legha et al., 18 Sep 2025).

Sequential Sample Size Workflow:

Begin with an initial cohort of $N_{\text{initial}}$ individuals; develop the CPM (e.g., penalized logistic regression).
At each interim (e.g., every $N_{\text{new}} = 100$ recruited patients), re-develop and bootstrap-validate the model.
Monitor learning curves for population-level (e.g., bias-corrected calibration slope, optimism in AUC) and individual-level stability metrics (mean prediction uncertainty interval width, mean maximal delta between point estimate and 95% interval, and mean misclassification probability at a clinical threshold).
Apply pre-specified stopping rules, such as:
- Corrected calibration slope $\geq 0.90$ ,
- Mean optimism in the c-statistic $\leq 0.02$ ,
- Mean 95% uncertainty-interval width $\leq 0.10$ ,
- Mean misclassification probability $\leq 0.10$ .
Continue until stopping rules are met over consecutive interims.

In application to AKI risk prediction, fixed-sample computation suggested $N=342$ , but sequential evaluation showed stability on population-level criteria only at $N\approx 1100$ , with individual-level stability reached near $N\approx 1800$ (Legha et al., 18 Sep 2025).

This learning-curve–guided design explicitly characterizes the transition from unstable to robust prediction and directly supports planning for sample size, monitoring for overfitting, and documenting model readiness at fine (individual) and coarse (cohort) levels (Riley et al., 2022).

2. Model Updating and Dynamic Environments

CPM performance can degrade over time due to changes in disease prevalence, case mix, clinical management, or predictor-outcome relationships. Robust frameworks for dynamic model updating have been developed, distinguishing three major strategies (Tanner et al., 2023):

Discrete Refitting: Retrain the model afresh on new data, allowing full parameter re-estimation and new features, at the risk of overfitting and loss of continuity when sample sizes are small.
Recalibration: Adjust only intercept and/or slope, preserving discriminative ability but correcting average risk drift; effective when underlying predictor effects are stable.
Bayesian Dynamic Updating: Treat model parameters as random, setting the posterior of one interval as the prior for the next (“knowledge-carryover”). This yields smooth transitions, leverages all past data, and supports updating with small samples and new predictors via weak priors—at the cost of requiring summary statistics or full data and greater computational expense.

Simulation and real EHR studies confirm Bayesian dynamic updating reliably tracks calibration and discrimination under rapid epidemiological shifts, accommodates novel features, and is less likely than discrete refitting to yield unstable predictions when data sparsity or new risk factors emerge. Optimal update frequency and method must reflect setting-specific clinical and operational constraints (Tanner et al., 2023).

3. Approaches to Missing Data in CPM Development and Implementation

Missing predictor data is ubiquitous in CPM pipelines and introduces bias if not rigorously addressed. Compatibility—the congruence of missing-data handling strategies across development, validation, and deployment—governs whether a CPM achieves accurate, unbiased, and robust performance at the point of care (Tsvetanova et al., 9 Apr 2025, Mi et al., 2024, 2206.12295).

Key principles:

When missingness is not allowed at deployment, develop and validate the CPM with multiple imputation with outcome ( $Y$ ) included; deployment adopts the same imputation specifications.
If missingness is allowed at deployment (e.g., via mean, regression, MI-no- $Y$ , or pattern sub-models), the same method and imputation model used at development must be applied during both validation and at prediction time.
Pattern sub-modeling (fitting a separate sub-CPM for each missing data pattern) is theoretically optimal under complex missingness, but demands large training sets to support each sub-model.
Imputation models for deployment must not use the outcome; deterministic regression imputation is favored for clinical deployment for operational simplicity, interpretability, and compatibility with real-time use.
Use of missing indicators may improve CPM performance under certain MAR or MNAR-X mechanisms, but is contraindicated when missingness depends on $Y$ due to potential overfitting (2206.12295).
Complete-case analysis should be avoided except under MCAR with low missingness due to inefficiency and bias amplification.

Pragmatically, package and document the CPM and its imputation machinery together for deployment (Tsvetanova et al., 9 Apr 2025, Mi et al., 2024). Internal validation (bootstrap or cross-validation) must mirror the intended real-world handling of missing data.

4. Generalizability and Transportability Across Populations

Transportability—the ability of a CPM to maintain appropriate calibration and discrimination when applied to new populations—remains a critical barrier to impact (Ploddi et al., 2024, Leeuwen et al., 2024). Approaches are categorized as:

Data-driven:
- Data augmentation/internal-external cross-validation: pool or stratify source data to increase heterogeneity.
- Ensemble methods: combine site-specific CPMs, tuned via out-of-cohort validation.
- Density ratio weighting (importance weighting): reweight training data to emulate the target population distribution.
Knowledge-driven/Causal:
- Explicit use of DAGs and the do-operator to identify “parent” or “S-admissible” variables whose mechanisms are invariant to population, removing or adjusting mutable factors.
- Graph surgery estimators and invariant set optimizations for shift-robustification.

Practical recommendations emphasize external validation on multiple cohorts, explicit reporting of calibration metrics, and rigorous evaluation of shift mechanisms (covariate, prior, concept), with possible synthesis of data- and knowledge-driven methods to maximize generalizability (Ploddi et al., 2024). Empirical-Bayes meta-analytic techniques demonstrate that for most CPMs, there is a ±0.1 irreducible uncertainty in AUC when deployed in a new population—even after numerous validations (Leeuwen et al., 2024).

5. Addressing Intercurrent Treatment: The Predictimand Framework

When patients may receive treatment after baseline, the risk targeted by a CPM—its “predictimand”—must be precisely defined (Geloven et al., 2020, Sperrin et al., 2017). Frameworks distinguish:

Observed treatment policy: predicts outcomes under actual mixture of care seen in derivation.
Composite outcomes: risk of event $\lor$ treatment, treating early treatment as an event.
Competing risks/“while untreated”: risk of event prior to treatment start.
Hypothetical no-treatment risk: counterfactual event risk if treatment were never initiated.

Each requires distinct estimation strategies, assumptions, and identifiability conditions. Marginal structural models (MSMs) with stabilized inverse-probability-of-treatment weighting are recommended for estimating “never-treated” risks, correcting for time-varying confounding and treatment drop-in (Sperrin et al., 2017). Explicit reporting of the chosen estimand, the underlying causal assumptions, and appropriate calibration/discrimination metrics is mandated for clinical interpretability and actionable use (Geloven et al., 2020).

6. New Frontiers: Human-AI Co-design, Embeddings, and Multi-Outcome CPMs

Human-AI Co-design for Interpretability: The HACHI framework iteratively combines an AI agent’s LLM-powered concept mining from clinical text with expert feedback. Sparse CPMs using binary (yes/no) concepts are refined for interpretability, bias-detection, and site/time generalizability. In real-world TBI and AKI tasks, human-in-the-loop refinement surfaced novel predictors, mitigated data leakage, and maintained high discrimination (AUC up to 0.91) (Feng et al., 14 Jan 2026).

Ontology-Aware Deep Learning: Integrating knowledge-graph embeddings (e.g., Poincaré embeddings of the SNOMED hierarchies) into deep learning architectures (ResNet, Transformer) yields modest yet systematic gains in discrimination (ΔAUROC ~0.02) for lung cancer onset prediction in EHR—without sacrificing calibration. This approach elegantly combines data-driven learning with the hierarchical structure of clinical language, offering improved semantic generalizability (John et al., 20 Aug 2025).

Multi-Outcome Risk Prediction: For joint prediction of correlated binary outcomes (e.g., multimorbidity), independent univariate CPMs systematically underpredict joint risks and may be miscalibrated. Probabilistic classifier chains, multinomial logistic regression, and Bayesian multivariate probit frameworks are recommended to model residual correlation and ensure well-calibrated joint and marginal predictions, especially when the phi coefficient exceeds 0.1 (Martin et al., 2020).

In contemporary clinical prediction modeling, methodological rigor at every stage—sample size determination, model updating, missing data handling, transportability, causal clarity regarding treatment, interpretability, and multidimensional outcome modeling—is essential to yield models that are trustworthy, generalizable, and fit for clinical decision making at the individual-patient level. This comprehensive workflow is supported by robust validation frameworks, learning-curve monitoring, and integration of domain expertise with state-of-the-art machine learning methodologies.