Papers
Topics
Authors
Recent
Search
2000 character limit reached

Landmarking Supermodels via Gradient Boosted Trees

Updated 28 January 2026
  • The paper’s main contribution is the integration of landmark supermodels with gradient boosted trees for dynamic survival prediction through a unified nonparametric framework.
  • This approach relaxes parametric and Markovian assumptions by pooling data from multiple landmark points to estimate a conditional hazard surface via Poisson regression.
  • Empirical evaluations show that the boosted-tree supermodel outperforms traditional Cox models in nonlinear and high-dimensional settings, ensuring temporal coherence and scalability.

Landmarking supermodels combined with gradient boosted trees provide a nonparametric methodology for dynamic prediction in event history analysis with high-dimensional, time-dependent covariates. This approach directly estimates the future conditional hazard surface via a unified model, relaxing parametric and Markovian assumptions. The estimator leverages gradient boosting on regression trees to capture nonlinearities and high-order interactions, while the statistical estimation procedure is formalized as a sieve M-estimation problem, and is computationally reducible to Poisson regression. This framework is distinguished by its theoretical consistency, flexibility to high-dimensional covariate structures, and avoidance of temporal incoherence common in traditional landmark Cox model approaches (Sandqvist, 24 Jan 2026).

1. Landmarking Supermodels: Definition and Structure

Given nn independent subjects observed over the interval [0,T][0,T], each subject ii is characterized by:

  • Ni(t)N_i(t): counting process for the event of interest,
  • Yi(t)Y_i(t): at-risk indicator,
  • Wi(t)RpW_i(t)\in\mathbb{R}^p: a high-dimensional, time-dependent covariate process.

For a fixed landmark time ss, the future conditional hazard at time tst\geq s, given covariate history up to ss, is

λ(t,s,Wi(s))=limΔt01ΔtP{Ni(t+Δt)Ni(t)=1Gs,t},\lambda(t, s, W_i(s)) = \lim_{\Delta t\downarrow 0}\frac{1}{\Delta t}P\{N_i(t+\Delta t)-N_i(t)=1 \mid \mathcal{G}_{s, t-}\},

where Gs,t=σ{Ni(u),Yi(u),Wi(s):u<t}\mathcal{G}_{s, t-} = \sigma\{N_i(u), Y_i(u), W_i(s): u < t\}.

Traditional landmark analyses fit separate models at fixed landmark times {s1,...,sQ}\{s_1,...,s_Q\}. The landmarking supermodel instead fits a single surface F(t,s,w)F(t, s, w) to approximate logλ(t,s,w)\log\lambda(t, s, w), pooling all landmarked data jointly for improved statistical efficiency and coherence.

2. Mathematical Formulation and Sieve M-Estimation

The empirical criterion for a candidate log-hazard surface F(t,s,w)F(t, s, w), at a landmark ss, for subject ii is: Li(s;F)=sTF(t,s,Wi(s))dNi(t)sTYi(t)exp(F(t,s,Wi(s)))dt.L_i(s; F) = \int_s^T F(t, s, W_i(s))\, dN_i(t) - \int_s^T Y_i(t) \exp(F(t, s, W_i(s)))\, dt. Random landmarks Si1,...,SiQS_{i1},...,S_{iQ} are introduced, independent and uniformly distributed on [0,T][0,T]. The per-subject empirical functional is

Mn(F)=1ni=1n1Qq=1QLi(Siq;F).M_n(F) = \frac{1}{n}\sum_{i=1}^n \frac{1}{Q}\sum_{q=1}^Q L_i(S_{iq}; F).

The estimation problem is intractable over the class of all measurable bounded functions, so a sequence of sieves Fk\mathcal{F}_k is defined; in practice, these are linear spans of regression trees. The estimator is given by

F^nargmaxFFnMn(F).\hat F_n \approx \arg\max_{F\in\mathcal{F}_n}M_n(F).

3. Reduction to Poisson Regression and Integration of Gradient Boosted Trees

On each sieve Fk\mathcal{F}_k, FF is piecewise constant on a partition {Bk}\{B_{k\ell}\} of (t,s,w)(t,s,w), i.e., FckF\equiv c_{k\ell} on cell BkB_{k\ell}. For landmark SiqS_{iq},

Oik(Siq,T]=SiqT1{(t,Siq,Wi(Siq))Bk}dNi(t),O_{ik\ell}(S_{iq}, T] = \int_{S_{iq}}^T 1\{(t, S_{iq}, W_i(S_{iq}))\in B_{k\ell}\}\, dN_i(t),

Eik(Siq,T]=SiqT1{(t,Siq,Wi(Siq))Bk}Yi(t)dt.E_{ik\ell}(S_{iq}, T] = \int_{S_{iq}}^T 1\{(t, S_{iq}, W_i(S_{iq}))\in B_{k\ell}\}\, Y_i(t)\, dt.

Then,

Mn(F)=1nQi,q,[ckOikeckEik],M_n(F) = \frac{1}{nQ}\sum_{i,q,\ell} [c_{k\ell}\,O_{ik\ell} - e^{c_{k\ell}}E_{ik\ell}],

which is the Poisson log-likelihood of counts OikO_{ik\ell} with exposure EikE_{ik\ell}. Consequently, the optimization can be performed using standard Poisson-loss gradient boosting, fitting trees to pseudo-residuals at each boosting step. Model training thus accommodates high-dimensional and nonlinear feature effects naturally.

4. Implementation, Computational Details, and Tuning

The approach is implemented as follows:

  • Data construction: For each subject and simulated landmark, compute OO and EE for all partition cells, yielding a "long" dataset with offsets logE\log E.
  • Boosting algorithm:
  1. Initialize F(0)(t,s,w)=logΛLF^{(0)}(t,s,w)=\log\Lambda_L.
  2. For m=1,...,mnm=1,...,m_n: compute negative gradients, fit regression trees (of depth dmd_m) to these residuals, update F(m)F^{(m)}, and truncate to [logΛL,logΛU][\log\Lambda_L,\log\Lambda_U].
  3. Selection of learning rate (ν\nu), number of trees (mm), and tree depth (dd) is by cross-validation on Poisson loss.
  • Scalability: The dataset has size on the order of n×Qn \times Q rows per boosting iteration; boosting iteration complexity is O(nQlogn)O(nQ\log n) using histogram-based splitting or O(nQd)O(nQd) for exact splits. Merging cells sharing coefficients reduces memory. Tree-based methods maintain efficiency as feature dimensionality grows, due to internal variable selection mechanisms.
  • Software compatibility: Methodology admits direct implementation in off-the-shelf gradient boosting libraries (e.g., XGBoost, LightGBM) with Poisson objective and exposure (offset).

5. Theoretical Guarantees and Temporal Coherence

Consistency of the estimator relies on two central properties:

  • Boundedness (exp[F]\exp[F] in [ΛL,ΛU][\Lambda_L,\Lambda_U]).
  • Approximate maximization (the fitted F^n\hat F_n nearly maximizes MnM_n over Fn\mathcal{F}_n up to oP(1)o_P(1) error).

Under these conditions and as the sieve partitions become increasingly fine, the estimator asymptotically recovers the true log-hazard surface in L1L^1-norm: F^nlogλL1(μ)=oP(1).\|\hat F_n - \log\lambda\|_{L^1(\mu)}=o_P(1). Temporal inconsistencies prevalent in landmark Cox models—where sequences of landmark-specific regression coefficients {β(s)}\{\beta(s)\} may not jointly define a proper conditional hazard without further assumptions—are avoided. The supermodel approach ensures internal coherence as the estimator is a unified function of (t,s,w)(t,s,w), and tree-based sieves are dense in the relevant function space. No additional Markov, homogeneity, or consistency conditions (e.g., Jewell–Nielsen) are necessary (Sandqvist, 24 Jan 2026).

6. Empirical Evaluation and Applied Results

Simulation studies examine three principal settings:

  • Scenario 1 (linear Markovian), 3 covariates: For small nn, the naive Cox method was optimal; at large nn (n105n\to 10^5), both Cox and boosted-tree supermodels performed equivalently.
  • Scenario 2 (nonlinear, non-Markovian): The boosted-tree supermodel yielded superior performance for moderate or large nn (n103n\ge 10^3); Cox supermodel was biased.
  • Scenario 3 (high-dimensional noise, p=50p=50): The boosted-tree supermodel resisted high variance, screening out 47 noise covariates, while the Cox supermodel displayed instability.

Across all settings, increasing the number of landmarks QQ yielded progressive improvements in RMSE of survival predictions.

The primary biliary cirrhosis (PBC) case study utilized data from 312 patients with irregularly timed visits and uniformly distributed landmarks (Q=10Q=10). Key predictors—bilirubin and albumin—were identified via Shapley values and partial-dependence plots. Generated individual dynamic survival curves matched established clinical prognostic models. Computational time was approximately one minute on a standard laptop using R and XGBoost.

7. Summary and Significance

The landmarking supermodel with gradient boosted trees extends classical landmark analysis to a fully nonparametric regime, enabling dynamic, high-dimensional survival prediction without restrictive model assumptions. The method:

  • Simultaneously leverages data from multiple landmark points via a single (t,s,w)-indexed supermodel,
  • Frames estimation as a sieve M-estimator of a Poisson-regression likelihood,
  • Employs gradient boosted trees for expressive, interpretable function approximation,
  • Achieves weak L1L^1-consistency of the fitted log-hazard,
  • Resolves temporal incoherence that affects multi-landmark Cox fitting by construction,
  • Remains computationally tractable and scalable to high-dimensional settings through standard machine learning tools (Sandqvist, 24 Jan 2026).

A plausible implication is increased applicability of dynamic event prediction in biomedical and other longitudinal studies, particularly where covariate processes are high-dimensional, non-Markovian, or where temporal interpretability is essential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Landmarking Supermodels with Gradient Boosted Trees.