High-Dimensional Misspecified Linear Models
- High-dimensional misspecified linear models are defined as linear approximations in settings where p exceeds n, targeting the pseudo-true projection for robust estimation.
- They employ methods like de-sparsified Lasso and double-estimation-friendly (DEF) inference to achieve valid confidence intervals and hypothesis tests despite misspecification.
- Advanced model selection techniques using generalized AIC/BIC with sandwich corrections enhance sparse model recovery and predictive performance under nonlinearity and heteroskedasticity.
High-dimensional misspecified linear models refer to statistical frameworks in which the observed data are modeled with a linear structure in high-dimensional settings (), but the true underlying data-generating mechanism is not necessarily linear. Despite the model misspecification, much recent research has focused on valid inference, robust estimation, variable/model selection, and uncertainty quantification in such regimes. The following entry synthesizes key theoretical foundations, methodology, interpretational issues, and recent developments.
1. Definition and Formal Structure
Let with and , be the observed data. The working model is
In the well-specified case, there exists such that with and . In the misspecified case, the true data-generating process is , where is arbitrary and (random design). The parameter of inferential interest then becomes the pseudo-true projection
which is the best linear approximation of (Bühlmann et al., 2015). Under fixed design and full column-rank , the (possibly sparse) representation is .
In practice, high dimensionality is quantified via , and the framework encompasses both random and fixed design regimes.
2. Inference Methodology and Robust Estimation
Classical inference procedures often yield invalid results under misspecification, especially in high dimensions and for standard debiased or regularized estimators. Recent advances address these limitations by developing procedures that target meaningful population quantities (often the projection parameter ) and employ robust techniques.
De-sparsified Lasso
The de-sparsified (or debiased) Lasso is a two-step procedure:
- Compute the initial Lasso estimator
- For each , estimate the approximate inverse of the design covariance via a nodewise Lasso regression to obtain residual direction .
- Form the estimator:
Under design and sparsity regularity (see assumptions below), the normalized estimator is asymptotically normal:
The variance is consistently estimated using “sandwich” formulas robust to heteroskedasticity and nonlinearity (Bühlmann et al., 2015). Confidence intervals based on this estimator maintain nominal level regardless of whether the model is well-specified.
Double-Estimation-Friendly (DEF) Inference
The DEF property refers to test statistics whose null distribution is robust to misspecification as long as either the model for or is correct. In high-dimensional settings, Shah and Bühlmann (Shah et al., 2019) propose DEF methods for hypothesis testing (e.g., conditional independence ) and construction of confidence intervals for regression parameters. Their high-dimensional DEF test uses regularized residuals from both on and on , constructed via square-root Lasso regressions, with a regularized partial-correlation statistic:
Type I error is controlled under if either nuisance is correctly specified and sparse. Coverage of confidence intervals is robust to nonlinearity and heteroskedasticity.
Model Selection under Misspecification
Conventional information criteria such as AIC or BIC are not robust under high-dimensional misspecification. Two lines of recent work (Lv & Liu (Basu et al., 2014), Lv et al. (Demirkaya et al., 2018)) obtain high-dimensional analogs of these criteria with explicit Kullback–Leibler risk expansions:
- Generalized AIC (GAIC) and Generalized BIC (GBIC) replace the classical penalty by terms involving the trace and log-determinant of the “covariance contrast” matrix , reflecting the sandwich variance inflation from misspecification.
- Large-scale penalization (GBIC/HGBIC) further includes a term for model size .
Uniform consistency and model-selection consistency are established under mild high-dimensional assumptions, and explicit plug-in estimators of and are used in practical computation.
Table: Information Criteria for Model Selection in High-Dimensional Misspecified Models
| Criterion | Penalty Structure | Misspecification Correction |
|---|---|---|
| Classic AIC/BIC | $2d$, | None |
| GAIC/GBIC | , | Yes (sandwich, KL) |
| GBIC/HGBIC | (GBIC), prior-based log-size | Yes (plus KL/prior and sandwich) |
The sandwich correction is essential to avoid over-selection and poor predictive risk under nonlinearity or non-normality (Basu et al., 2014, Demirkaya et al., 2018).
3. Sufficient Conditions for Asymptotic Validity
Inference and selection procedures in high-dimensional misspecified linear models require regularity beyond classical settings. Typical assumptions include:
- Design matrix: minimal and maximal eigenvalues of are bounded above and below; uniform boundedness of covariates and nodewise regression residuals.
- Sparsity: both the pseudo-true parameter and nodewise regression coefficients must be sufficiently sparse, typically .
- Error moments: (conditional) sub-Gaussianity or finite -moment for errors.
- Compatibility/restricted eigenvalue: ensuring regularized estimators are well behaved.
- Nondegeneracy: residual variance of score corrections must be bounded away from zero.
These conditions ensure that rates of convergence match those under well-specified models and that robust procedures (debiased Lasso, DEF inference) deliver reliable uncertainty quantification (Bühlmann et al., 2015, Shah et al., 2019).
4. Interpretation and Scope of Inference
Under misspecification, population regression coefficients lack a straightforward causal or structural interpretation. The correct target is always the -projection of the (potentially nonlinear) regression function onto the span of :
Hypothesis tests and confidence intervals thus pertain to these “pseudo-true” parameters. For Gaussian predictors, the non-zeros of often correspond to active predictors in ’s support. In fixed-design scenarios, confidence intervals obtained are uniform across all possible sufficiently sparse approximations of by (Bühlmann et al., 2015).
For model-selection, sandwich-based generalized information criteria ensure that procedures select sparse models approximating in -risk, not necessarily “true” support.
5. Extensions and Practical Recommendations
- Measurement error: Extensions of DEF methodologies handle measurement error under double-robustness, ensuring that inference for a component of is valid as long as either the outcome or exposure model is correctly specified and sparse (Cui et al., 2024).
- Prediction inference with general/dense loadings: Wald-type inference for predictions (with dense ) is possible in high-dimensional misspecified models, provided sparsity of the precision matrix or inverse Hessian instead of itself (Liang et al., 15 Jul 2025).
- Semi-parametric and double-robust inference: For missing data or potential outcome models, double-robust -estimation and desparsified inference are valid for projection parameters as long as one of the nuisance models (propensity or outcome regression) is correctly specified and sparse (Chakrabortty et al., 2019, Dukes et al., 2019).
- Validation and software: Implementations such as the “hdi” R package provide both classical and robust sandwich variance estimation for de-sparsified Lasso (Bühlmann et al., 2015).
Recommended practice is to explicitly treat as random (unconditional inference) or fixed (conditional), check for compatibility via pilot fits, and employ sandwich variance estimation if model misspecification or heteroskedasticity/nonlinearity is suspected.
6. Related Developments and Limitations
- High-dimensional F-statistics: The classical -test for submodel regression remains asymptotically valid even under gross misspecification of the submodel under precise high-dimensional conditions, effectively testing whether the best linear predictor is constant (Leeb et al., 2019).
- High-dimensional mixed models: REML estimation of variance components remains consistent for error variance and for total heritability in sparse/structured random-effects models, even under full-model misspecification; after suitable scaling, random-effect variance is consistently estimated (Jiang et al., 2014, Dao et al., 2021).
- Bayesian inference: Standard Bayesian posteriors can be inconsistent under high-dimensional misspecification, exhibiting non-concentration or hypercompression. The SafeBayes algorithm rectifies this by adaptively flattening the likelihood (learning rate ), restoring consistency for prediction and calibration (Grünwald et al., 2014).
7. Empirical Performance and Simulation Findings
Simulations across a range of misspecification scenarios consistently show:
- Sandwich-based, debiased, and DEF-type methods maintain nominal coverage for confidence intervals even under nonlinearity or heteroskedasticity, while naive or classical approaches can substantially undercover.
- Model-selection procedures with explicit sandwich and KL penalty (GBIC, HGBIC) avoid over-selection and spurious discoveries in ultra-high-dimensional settings, outperforming AIC/BIC when the linear model is only an approximation (Basu et al., 2014, Demirkaya et al., 2018).
- GCV-tuned ridge regression provides prediction risk close to the oracle across a broad range of misspecification, outperforming Lasso for Gaussian designs (Shinkyu, 20 Jan 2026).
- In GWAS and mixed model contexts, REML error variance estimates are consistent regardless of sparsity; genetic variance estimates may be “shrunk” by sparsity but are consistent after scaling (Jiang et al., 2014, Dao et al., 2021).
In summary, high-dimensional misspecified linear models constitute a well-defined regime where robust inference, selection, and prediction are achievable by targeting projection parameters and employing sandwich correction, double-robust or de-sparsified procedures. These methods accommodate both the high-dimensionality and the model uncertainty that inevitably arise in complex applications, with validated theoretical guarantees and practical implementations (Bühlmann et al., 2015, Shah et al., 2019, Basu et al., 2014, Liang et al., 15 Jul 2025, Chakrabortty et al., 2019, Cui et al., 2024).