Cox-MT: Unified Survival & Event Modeling

Updated 4 February 2026

Cox-MT model is a family of frameworks that generalizes traditional Cox processes by integrating survival analysis, multi-task event modeling, and dependent point process inference.
It combines deep neural Cox regression with a Mean Teacher architecture, using both supervised and consistency-based losses to robustly learn from censored and unlabeled data.
The model extends to log-Gaussian Cox processes and credit risk default modeling, enabling scalable variational inference and closed-form computations for diverse applications.

The Cox-MT model refers to a family of advanced Cox process and proportional hazards frameworks that generalize survival analysis, multi-task event modeling, and dependent point process inference. The term "Cox-MT" appears in multiple recent works, denoting (1) deep semi-supervised Cox proportional hazards models utilizing a Mean Teacher architecture for survival prediction (Sun et al., 28 Jan 2026), (2) multi-task log-Gaussian Cox process constructions sharing latent functions and encoding inter-task correlations (Aglietti et al., 2018), and (3) generalized multivariate Cox processes enabling complex default dependence modeling in credit risk (Gueye et al., 7 Aug 2025). The following sections delineate key theoretical formulations, computational techniques, and applied results linked with the Cox-MT paradigm.

1. Mean Teacher Deep Cox Model in Survival Prediction

The Cox-MT implementation in (Sun et al., 28 Jan 2026) merges neural Cox regression with semi-supervised learning via the Mean Teacher protocol. The model comprises:

Architecture: Two feedforward neural nets (student and teacher), parameterized respectively by θ and θ′ (teacher updated by exponential moving average: θ′t ← α θ′{t–1} + (1–α) θ_t). Single-modal instances ingest high-dimensional tabular (gene expression) or image features (DINOv2 whole-slide image embeddings). Multi-modal variants tokenize features and use mutual cross-attention between modalities before a final MLP.
Loss function: The total loss combines supervised Cox partial-likelihood (for uncensored/time-to-event samples) and a consistency regularization across censored and unlabeled samples,
- Supervised:
$L_{\rm Cox}(\theta) = -\frac{1}{|D_e|}\sum_{i\in D_e} [f_\theta(x_i+\eta_i) - \log\sum_{j\in R(t_i)} e^{f_\theta(x_j+\eta_j)}]$ - Unlabeled/Censored Regularization:

$L_{\rm cons}(\theta,\theta') = \frac{1}{|D_c\cup D_u|}\sum_{i\in D_c\cup D_u} [f_\theta(x_i+\eta_i)-f_{\theta'}(x_i+\eta'_i)]^2$ - Combined:

$L_{\rm total} = L_{\rm Cox} + \lambda(t)L_{\rm cons}$

with $\lambda(t)$ typically constant ( $w\in[0.1,3]$ ), optionally ramped up in early epochs.
Handling of censored and unlabeled data: Censored and fully unlabeled data influence learning via the Mean Teacher consistency term—teacher scores serve as soft targets, not discrete pseudo-labels. Data perturbations (noise, dropout, augmentations) yield robustness to input variability.
Empirical results: Cox-MT outperforms Cox-nnet across four TCGA cancer cohorts, with marked improvement as the number of unlabeled samples increases (BRCA c-index: 0.81→0.90, IBS: 0.087→0.061). Multi-modal Cox-MT leverages cross-attention to exceed single-modal performance.

This suggests Cox-MT's general recipe (student/teacher, partial-likelihood, soft regularization) may be applied to time-to-event modeling outside biology whenever labeled data is scarce and large auxiliary cohorts exist.

2. Multi-task Log-Gaussian Cox Process Model

The Cox-MT construction in (Aglietti et al., 2018) generalizes classical Cox process modeling to correlated multi-task event point-processes by:

Model formulation:

$f_p(x) = \sum_{q=1}^Q A_{pq}u_q(x), \;\; \lambda_p(x) = \exp(f_p(x))$

where $u_q(\cdot) \sim \mathrm{GP}(0, k_q)$ (shared latent functions), mixing coefficients $A_{pq}$ treated as GP draws themselves: $A_q(p) \sim \mathrm{GP}(0, k_A)$ .

Moment computations: First and second moments of $\lambda_p(x)$ are derived in closed form, allowing calculation of expected intensities and cross-task covariance:

$E[\lambda_p(x)] = \exp\bigl( \frac12 \sum_q k_A(p,p)k_q(x,x) \bigr)$

$\mathrm{Cov}[\lambda_p(x), \lambda_{p'}(x')] = \exp(\frac12(v_{pp}+v_{p'p'}+2v_{pp'})) - \exp(\frac12 v_{pp})\exp(\frac12 v_{p'p'})$

Variational inference: Introduces inducing points for $u_q$ and $A_q$ ; mean-field Gaussian posteriors parametrized for scalable inference. The evidence lower bound (ELBO) enables gradient optimization of model parameters.
Computational efficiency: Inducing-point methods enable order-of-magnitude speedup over MCMC samplers for multivariate LGCPs, scalable to $P,N>50$ tasks/events.

This suggests Cox-MT is suitable for joint modeling of spatial-temporal phenomena across related event types, with direct extension to coregionalization and Bayesian hierarchical inference.

3. Multivariate Generalized Cox Processes for Dependent Defaults

The Cox-MT framework in (Gueye et al., 7 Aug 2025) addresses dependent default timing in credit risk via a multivariate construction that encompasses both common and idiosyncratic shocks:

Setup: Default times

$\tau_i := \inf\{t\ge0 : K^i_t \ge \Theta^i\}$

with $K^i_t$ an adapted, increasing càdlàg process (typically Lévy/compound Poisson/subordinator or shot-noise), $\Theta^i \sim \mathrm{Exp}(1)$ independent.

Azéma supermartingale and compensator representation:

$Z^i_t = e^{-K^i_t} = \eta^i_t e^{-\Lambda^i_t}$

Under deterministic compensator assumptions, $\eta^i \equiv 1$ , yielding $Z^i_t = e^{-\Lambda^i_t}, \ \Lambda^i_t=K^i_t$ .

Construction of intensities: Each compensator is a sum of continuous and jump-driven parts,

$\Lambda^i(t) = \int_0^t \lambda^c_i(s) ds + \sum_j X_{i,j}(t)$

where $X_{i,j}$ can encode idiosyncratic or systemic jumps.

Joint survival probabilities:

$S(t_1,\dots,t_d) = \exp\Bigl\{ -\sum_{\emptyset\neq J\subseteq\{1,\dots,d\}} \gamma^J \max_{i\in J} t_i \Bigr\}$

with Möbius-inversion weights $\gamma^J$ derived from the underlying jump parameters.

Special cases: Construction recovers independent Cox processes, common-factors, and pure compound Poisson cases as nested submodels.
Extension: Allows superposition of continuous Cox intensities and jump-driven default processes: the survival function factorizes over continuous and jump components.
Calibration and implementation: Analytical tractability (closed-form survival probabilities, Laplace transforms) facilitates calibration to market data and efficient numeric simulation (Monte Carlo of jump times, Fourier-Laplace inversion for survival probabilities).

This suggests Cox-MT enables unified modeling of abrupt (jump-driven) and gradual (continuous) sources of systemic and individual default risk, bridging structural and reduced-form credit risk models.

4. Calibration, Computational Implementation, and Efficiency

Across instantiations, Cox-MT models leverage analytical closed forms and scalable variational or Monte Carlo schemes:

Calibration: Parameters (e.g., continuous rates, Lévy exponents, cross-attention fusion layers, GP kernel hyperparameters) are fitted via maximum-likelihood, moment-matching, or gradient-based optimization. Marginal survival curves can be matched precisely, and joint dependencies tuned via latent function or noise kernel selection (Gueye et al., 7 Aug 2025, Aglietti et al., 2018).
Implementation:
- Deep Cox-MT: Adam optimizer, cross-validation over learning rates, robust to input noise/dropout (Sun et al., 28 Jan 2026).
- Multi-task Cox processes: Inducing-point selection by k-means, jitter for numerical stability, batch optimization of ELBO (Aglietti et al., 2018).
- Default modeling: Simulation of jump times and sizes, fast convolution for shot-noise components, Laplace transform inversion for multi-period survival (Gueye et al., 7 Aug 2025).
Computational scaling: Variational inference with inducing points drops computational complexity from $O((PN)^3)$ to $O(Q(M^3+M'^3+PM^2+PM'^2))$ per ELBO evaluation (Aglietti et al., 2018).

5. Applications and Empirical Results

The Cox-MT framework has demonstrated empirical strength in diverse contexts:

Survival analysis: Cancer prognosis prediction, with semi-supervised gains (c-index improvement up to +0.09 to +0.18, IBS reductions 0.038–0.082) and superior multi-modal fusion (Sun et al., 28 Jan 2026).
Spatial-temporal event modeling: Experiments on spatial crime datasets reveal Cox-MT achieves 10–100× speedup (and comparable or higher held-out log-likelihood) versus full-factorized LGCP or coregionalization models (Aglietti et al., 2018).
Credit risk: Closed-form default probability and tranche price computation under the Cox-MT model with joint Lévy and shot-noise factors (Gueye et al., 7 Aug 2025).

6. Generalization and Transferability

The Cox-MT paradigm is structurally transferable:

The Mean Teacher approach (deep Cox-MT) may be ported to any time-to-event domain, including engineering, medicine, and finance, subject to availability of large unlabeled or censored cohorts (Sun et al., 28 Jan 2026).
Multi-task Cox process constructions extend naturally to ecological, epidemiological, and network event prediction, accommodating arbitrary inter-process dependence via GP priors (Aglietti et al., 2018).
The generalized multivariate Cox process formalism admits unification of classical and jump-driven default/event models, and recovery of decorrelated or structured dependencies as special cases (Gueye et al., 7 Aug 2025).

A plausible implication is that Cox-MT models both unify the theoretical foundations of event-time modeling and provide a computationally tractable framework for learning in high-dimensional, correlated, and semi-supervised regimes.