Regression Model for Censored Data
- Regression models for censored data are specialized frameworks that address partially observed responses through methods like right-censoring in survival analysis.
- Dimension reduction via conditional independence and single-index models simplifies high-dimensional kernel smoothing, mitigating the curse of dimensionality.
- Weighted empirical risk minimization and two-stage trimmed least squares yield asymptotically normal inference even with complex, high-dimensional covariates.
A regression model for censored data refers to any statistical or machine learning framework in which the response variable is only partially observed due to censoring—most commonly through right-censoring in survival analysis, but also through left, interval, or random censoring mechanisms. The canonical paradigm involves having a multivariate covariate and a univariate response %%%%1%%%% which is observable only up to an associated censoring variable , so that one actually observes and an indicator . The principal challenge is to devise estimators for regression or conditional distribution functionals that correctly account for the censorship while controlling for the curse of dimensionality stemming from high-dimensional covariates.
1. Problem Setup and the Curse of Dimensionality
In the censored data regression context, for a sample of i.i.d. triplets , the fundamental object is the estimation of a regression or distributional functional of the unobserved . Naïve nonparametric estimators for, for instance,
nominally require kernel or histogram smoothing in dimensions, leading to exponential sample size requirements in . Direct approaches are thus practically infeasible for moderate-to-large , motivating the need for dimension reduction assumptions that specifically exploit structure in the dependence of (and/or ) on .
2. Dimension Reduction via Conditional Independence and Single-Index Models
The central structural assumption (A0) posits that for some known (often taken as for a low-dimensional parameter ), the censoring variable and the response are conditionally independent given . That is,
Typically, is parameterized so that , and the estimation of conditional functionals can be performed by nonparametric smoothing on , thereby bypassing the high-dimensionality of in the smoothing step.
In the regression problem, a further dimension reduction is introduced via a mean regression single-index model for a possibly truncated response:
where is a finite-dimensional parameter and is an unknown smooth function. This structure means all information about influencing the mean is projected onto the one-dimensional index .
3. Construction of Joint Distribution and Regression Estimators
3.1. Joint Distribution Estimation
Given the conditional independence structure via , the conditional distribution of censoring at time is estimated via a generalized Beran estimator:
where for a univariate kernel and bandwidth .
The joint estimator of then takes the form
where is a root--consistent estimator for the index parameter. This construction corrects for the effect of censoring via a weighting scheme that adapts to the conditional survival of censoring.
3.2. Mean Regression Single-Index Estimation
The regression parameter is estimated via a two-stage trimmed minimum least squares approach:
- Initial estimator: Minimize
over in a compact set, with initial trimming function and nonparametric kernel estimator for .
- Final estimator: With preliminary , update the trimming region and minimize the same criterion over in shrinking neighborhoods to obtain . Here, is the density of under truncation at .
The nonparametric regression function is estimated as
where is a univariate kernel and is a bandwidth.
4. Asymptotic Theory and Efficiency
Under appropriate regularity (smoothness of , positivity of densities, conditions on kernels and bandwidths), the estimators admit uniform consistency and i.i.d.-style asymptotic (influence function) representations for general functionals:
For the regression parameter,
with defined in terms of the derivatives of the regression function, trimming regions, and the conditional distribution of . The estimator is root- consistent and asymptotically normal:
with computable from the influence function representation.
5. Methodological Innovations and Practical Implications
The methodology fundamentally leverages:
- Dimension reduction in the censoring model: By parametrizing , kernel estimation for censoring correction operates in one dimension, reducing variance and avoiding the curse of dimensionality.
- Single-index regression: The mean structure is reduced to , allowing a fully nonparametric regression function over a single argument, further mitigating dimensionality issues.
- Weighted empirical risk minimization: Both the joint distribution and the regression parameter estimators use weights that correct for censoring, via conditioning on the low-dimensional summaries of covariate information.
These allow consistent, asymptotically normal inference about joint and regression functionals in high-dimensional censored data, provided the conditional independence and single-index assumptions hold.
6. Theoretical and Computational Considerations
The kernel smoothing steps require bandwidth selection, and the estimator for in must be root- consistent. The two-stage regression procedure requires careful selection of the trimming region and control of approximation error at the boundaries of the covariate space. Martingale asymptotics and counting process theory underlie the uniform convergence and central limit behavior.
The approach is implementable with moderate computational resources when is large but is small and kernel density estimation is practical. The main computational burden is in the iterative nonparametric estimation of the conditional censoring distribution and in the optimization over the regression parameter.
7. Impact on High-Dimensional Survival and Censored Regression
This framework addresses two major limitations in previous censored regression methodology:
- It circumvents the high-variance, low-precision regime induced by kernel smoothing in high dimensions by explicit and testable dimension reduction in both censoring and regression index structure.
- It provides asymptotics and practical implementation steps for kernel-based censored regression that are robust to high-dimensional, potentially complex, covariate distributions and censoring that depends on observable covariates.
The approach has direct implications for large-scale biomedical survival analysis, reliability studies with high-dimensional predictors, and in semiparametric regression models where the classic Cox proportional hazards model’s assumptions are not tenable.
Key Formula Recap:
| Quantity | Formula |
|---|---|
| Joint distribution estimator | |
| Conditional censoring cdf | |
| Nonparametric regression estimator | |
| Regression parameter estimator | |
| Asymptotic linearization |
This approach constitutes a significant advance in the methodology for regression with censored data under high-dimensional covariates, enabling practical and theoretically valid inference when classical nonparametric and semiparametric approaches are no longer feasible due to dimension and censoring dependencies (Lopez et al., 2011).