Landmarking Supermodels via Gradient Boosted Trees

Updated 28 January 2026

The paper’s main contribution is the integration of landmark supermodels with gradient boosted trees for dynamic survival prediction through a unified nonparametric framework.
This approach relaxes parametric and Markovian assumptions by pooling data from multiple landmark points to estimate a conditional hazard surface via Poisson regression.
Empirical evaluations show that the boosted-tree supermodel outperforms traditional Cox models in nonlinear and high-dimensional settings, ensuring temporal coherence and scalability.

Landmarking supermodels combined with gradient boosted trees provide a nonparametric methodology for dynamic prediction in event history analysis with high-dimensional, time-dependent covariates. This approach directly estimates the future conditional hazard surface via a unified model, relaxing parametric and Markovian assumptions. The estimator leverages gradient boosting on regression trees to capture nonlinearities and high-order interactions, while the statistical estimation procedure is formalized as a sieve M-estimation problem, and is computationally reducible to Poisson regression. This framework is distinguished by its theoretical consistency, flexibility to high-dimensional covariate structures, and avoidance of temporal incoherence common in traditional landmark Cox model approaches (Sandqvist, 24 Jan 2026).

1. Landmarking Supermodels: Definition and Structure

Given $n$ independent subjects observed over the interval $[0,T]$ , each subject $i$ is characterized by:

$N_i(t)$ : counting process for the event of interest,
$Y_i(t)$ : at-risk indicator,
$W_i(t)\in\mathbb{R}^p$ : a high-dimensional, time-dependent covariate process.

For a fixed landmark time $s$ , the future conditional hazard at time $t\geq s$ , given covariate history up to $s$ , is

$\lambda(t, s, W_i(s)) = \lim_{\Delta t\downarrow 0}\frac{1}{\Delta t}P\{N_i(t+\Delta t)-N_i(t)=1 \mid \mathcal{G}_{s, t-}\},$

where $\mathcal{G}_{s, t-} = \sigma\{N_i(u), Y_i(u), W_i(s): u < t\}$ .

Traditional landmark analyses fit separate models at fixed landmark times $\{s_1,...,s_Q\}$ . The landmarking supermodel instead fits a single surface $F(t, s, w)$ to approximate $\log\lambda(t, s, w)$ , pooling all landmarked data jointly for improved statistical efficiency and coherence.

2. Mathematical Formulation and Sieve M-Estimation

The empirical criterion for a candidate log-hazard surface $F(t, s, w)$ , at a landmark $s$ , for subject $i$ is: $L_i(s; F) = \int_s^T F(t, s, W_i(s))\, dN_i(t) - \int_s^T Y_i(t) \exp(F(t, s, W_i(s)))\, dt.$ Random landmarks $S_{i1},...,S_{iQ}$ are introduced, independent and uniformly distributed on $[0,T]$ . The per-subject empirical functional is

$M_n(F) = \frac{1}{n}\sum_{i=1}^n \frac{1}{Q}\sum_{q=1}^Q L_i(S_{iq}; F).$

The estimation problem is intractable over the class of all measurable bounded functions, so a sequence of sieves $\mathcal{F}_k$ is defined; in practice, these are linear spans of regression trees. The estimator is given by

$\hat F_n \approx \arg\max_{F\in\mathcal{F}_n}M_n(F).$

3. Reduction to Poisson Regression and Integration of Gradient Boosted Trees

On each sieve $\mathcal{F}_k$ , $F$ is piecewise constant on a partition $\{B_{k\ell}\}$ of $(t,s,w)$ , i.e., $F\equiv c_{k\ell}$ on cell $B_{k\ell}$ . For landmark $S_{iq}$ ,

$O_{ik\ell}(S_{iq}, T] = \int_{S_{iq}}^T 1\{(t, S_{iq}, W_i(S_{iq}))\in B_{k\ell}\}\, dN_i(t),$

$E_{ik\ell}(S_{iq}, T] = \int_{S_{iq}}^T 1\{(t, S_{iq}, W_i(S_{iq}))\in B_{k\ell}\}\, Y_i(t)\, dt.$

Then,

$M_n(F) = \frac{1}{nQ}\sum_{i,q,\ell} [c_{k\ell}\,O_{ik\ell} - e^{c_{k\ell}}E_{ik\ell}],$

which is the Poisson log-likelihood of counts $O_{ik\ell}$ with exposure $E_{ik\ell}$ . Consequently, the optimization can be performed using standard Poisson-loss gradient boosting, fitting trees to pseudo-residuals at each boosting step. Model training thus accommodates high-dimensional and nonlinear feature effects naturally.

4. Implementation, Computational Details, and Tuning

The approach is implemented as follows:

Data construction: For each subject and simulated landmark, compute $O$ and $E$ for all partition cells, yielding a "long" dataset with offsets $\log E$ .
Boosting algorithm:

Initialize $F^{(0)}(t,s,w)=\log\Lambda_L$ .
For $m=1,...,m_n$ : compute negative gradients, fit regression trees (of depth $d_m$ ) to these residuals, update $F^{(m)}$ , and truncate to $[\log\Lambda_L,\log\Lambda_U]$ .
Selection of learning rate ( $\nu$ ), number of trees ( $m$ ), and tree depth ( $d$ ) is by cross-validation on Poisson loss.

Scalability: The dataset has size on the order of $n \times Q$ rows per boosting iteration; boosting iteration complexity is $O(nQ\log n)$ using histogram-based splitting or $O(nQd)$ for exact splits. Merging cells sharing coefficients reduces memory. Tree-based methods maintain efficiency as feature dimensionality grows, due to internal variable selection mechanisms.
Software compatibility: Methodology admits direct implementation in off-the-shelf gradient boosting libraries (e.g., XGBoost, LightGBM) with Poisson objective and exposure (offset).

5. Theoretical Guarantees and Temporal Coherence

Consistency of the estimator relies on two central properties:

Boundedness ( $\exp[F]$ in $[\Lambda_L,\Lambda_U]$ ).
Approximate maximization (the fitted $\hat F_n$ nearly maximizes $M_n$ over $\mathcal{F}_n$ up to $o_P(1)$ error).

Under these conditions and as the sieve partitions become increasingly fine, the estimator asymptotically recovers the true log-hazard surface in $L^1$ -norm: $\|\hat F_n - \log\lambda\|_{L^1(\mu)}=o_P(1).$ Temporal inconsistencies prevalent in landmark Cox models—where sequences of landmark-specific regression coefficients $\{\beta(s)\}$ may not jointly define a proper conditional hazard without further assumptions—are avoided. The supermodel approach ensures internal coherence as the estimator is a unified function of $(t,s,w)$ , and tree-based sieves are dense in the relevant function space. No additional Markov, homogeneity, or consistency conditions (e.g., Jewell–Nielsen) are necessary (Sandqvist, 24 Jan 2026).

6. Empirical Evaluation and Applied Results

Simulation studies examine three principal settings:

Scenario 1 (linear Markovian), 3 covariates: For small $n$ , the naive Cox method was optimal; at large $n$ ( $n\to 10^5$ ), both Cox and boosted-tree supermodels performed equivalently.
Scenario 2 (nonlinear, non-Markovian): The boosted-tree supermodel yielded superior performance for moderate or large $n$ ( $n\ge 10^3$ ); Cox supermodel was biased.
Scenario 3 (high-dimensional noise, $p=50$ ): The boosted-tree supermodel resisted high variance, screening out 47 noise covariates, while the Cox supermodel displayed instability.

Across all settings, increasing the number of landmarks $Q$ yielded progressive improvements in RMSE of survival predictions.

The primary biliary cirrhosis (PBC) case study utilized data from 312 patients with irregularly timed visits and uniformly distributed landmarks ( $Q=10$ ). Key predictors—bilirubin and albumin—were identified via Shapley values and partial-dependence plots. Generated individual dynamic survival curves matched established clinical prognostic models. Computational time was approximately one minute on a standard laptop using R and XGBoost.

7. Summary and Significance

The landmarking supermodel with gradient boosted trees extends classical landmark analysis to a fully nonparametric regime, enabling dynamic, high-dimensional survival prediction without restrictive model assumptions. The method:

Simultaneously leverages data from multiple landmark points via a single (t,s,w)-indexed supermodel,
Frames estimation as a sieve M-estimator of a Poisson-regression likelihood,
Employs gradient boosted trees for expressive, interpretable function approximation,
Achieves weak $L^1$ -consistency of the fitted log-hazard,
Resolves temporal incoherence that affects multi-landmark Cox fitting by construction,
Remains computationally tractable and scalable to high-dimensional settings through standard machine learning tools (Sandqvist, 24 Jan 2026).

A plausible implication is increased applicability of dynamic event prediction in biomedical and other longitudinal studies, particularly where covariate processes are high-dimensional, non-Markovian, or where temporal interpretability is essential.

Markdown Report Issue Upgrade to Chat

References (1)

Event history analysis with time-dependent covariates via landmarking supermodels and boosted trees (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Landmarking Supermodels with Gradient Boosted Trees.