High-Dimensional Misspecified Linear Models

Updated 27 January 2026

High-dimensional misspecified linear models are defined as linear approximations in settings where p exceeds n, targeting the pseudo-true projection for robust estimation.
They employ methods like de-sparsified Lasso and double-estimation-friendly (DEF) inference to achieve valid confidence intervals and hypothesis tests despite misspecification.
Advanced model selection techniques using generalized AIC/BIC with sandwich corrections enhance sparse model recovery and predictive performance under nonlinearity and heteroskedasticity.

High-dimensional misspecified linear models refer to statistical frameworks in which the observed data are modeled with a linear structure in high-dimensional settings ( $p \gg n$ ), but the true underlying data-generating mechanism is not necessarily linear. Despite the model misspecification, much recent research has focused on valid inference, robust estimation, variable/model selection, and uncertainty quantification in such regimes. The following entry synthesizes key theoretical foundations, methodology, interpretational issues, and recent developments.

1. Definition and Formal Structure

Let $\{(Y^{(i)},X^{(i)})\}_{i=1}^n,$ with $Y^{(i)}\in\mathbb{R}$ and $X^{(i)}\in\mathbb{R}^p$ , be the observed data. The working model is

$Y = X\beta + \varepsilon, \qquad \beta\in\mathbb{R}^p.$

In the well-specified case, there exists $\beta^*$ such that $Y^{(i)} = (X^{(i)})^T\beta^* + \varepsilon^{(i)}$ with $E[\varepsilon^{(i)}|X^{(i)}]=0$ and $\mathrm{Var}(\varepsilon^{(i)}|X^{(i)})=\sigma^2$ . In the misspecified case, the true data-generating process is $Y^{(i)} = f^0(X^{(i)}) + \xi^{(i)}$ , where $f^0$ is arbitrary and $E[\xi^{(i)}|X^{(i)}]=0$ (random design). The parameter of inferential interest then becomes the pseudo-true projection

$\beta^0 = \arg\min_{\beta\in\mathbb{R}^p} E\bigl[f^0(X^{(0)}) - (X^{(0)})^T\beta\bigr]^2,$

which is the $L_2$ best linear approximation of $f^0$ (Bühlmann et al., 2015). Under fixed design and full column-rank $X$ , the (possibly sparse) representation is $\beta^0 = \arg\min_{\beta}\{\|\beta\|_1: X\beta = f^0(X)\}$ .

In practice, high dimensionality is quantified via $p \gg n$ , and the framework encompasses both random and fixed design regimes.

2. Inference Methodology and Robust Estimation

Classical inference procedures often yield invalid results under misspecification, especially in high dimensions and for standard debiased or regularized estimators. Recent advances address these limitations by developing procedures that target meaningful population quantities (often the projection parameter $\beta^0$ ) and employ robust techniques.

De-sparsified Lasso

The de-sparsified (or debiased) Lasso is a two-step procedure:

Compute the initial Lasso estimator

$\hat\beta = \arg\min_{\beta} \Bigl\{\|Y - X\beta\|_2^2/n + \lambda\|\beta\|_1\Bigr\}$

For each $j=1,\ldots,p$ , estimate the approximate inverse of the design covariance via a nodewise Lasso regression to obtain residual direction $Z_j$ .
Form the estimator:

$\hat b_j = \hat\beta_j + \frac{1}{n} Z_j^T (Y - X\hat\beta)$

Under design and sparsity regularity (see assumptions below), the normalized estimator is asymptotically normal:

$\sqrt{n} \frac{Z_j^T X_j/n}{\omega_{p;jj}}(\hat b_j - \beta^0_j) \xrightarrow{d} \mathcal{N}(0,1),\qquad \omega_{p;jj}^2 = E\bigl[Z_j^{(0)} \varepsilon^{(0)}\bigr]^2$

The variance $\omega_{p;jj}^2$ is consistently estimated using “sandwich” formulas robust to heteroskedasticity and nonlinearity (Bühlmann et al., 2015). Confidence intervals based on this estimator maintain nominal level regardless of whether the model is well-specified.

Double-Estimation-Friendly (DEF) Inference

The DEF property refers to test statistics whose null distribution is robust to misspecification as long as either the model for $Y|X,Z$ or $X|Z$ is correct. In high-dimensional settings, Shah and Bühlmann (Shah et al., 2019) propose DEF methods for hypothesis testing (e.g., conditional independence $X \perp Y|Z$ ) and construction of confidence intervals for regression parameters. Their high-dimensional DEF test uses regularized residuals from both $Y$ on $Z$ and $X$ on $Z$ , constructed via square-root Lasso regressions, with a regularized partial-correlation statistic:

$T_{DEF} = \sqrt{n} \frac{ (Y - Z\hat\beta^Y)^T (X - Z\hat\beta^X) } { \|Y - Z\hat\beta^Y\| \|X - Z\hat\beta^X\| }$

Type I error is controlled under $H_0$ if either nuisance is correctly specified and sparse. Coverage of confidence intervals is robust to nonlinearity and heteroskedasticity.

Model Selection under Misspecification

Conventional information criteria such as AIC or BIC are not robust under high-dimensional misspecification. Two lines of recent work (Lv & Liu (Basu et al., 2014), Lv et al. (Demirkaya et al., 2018)) obtain high-dimensional analogs of these criteria with explicit Kullback–Leibler risk expansions:

Generalized AIC (GAIC) and Generalized BIC (GBIC) replace the classical penalty by terms involving the trace and log-determinant of the “covariance contrast” matrix $H_n^{-1}K_n$ , reflecting the sandwich variance inflation from misspecification.
Large-scale penalization (GBIC $_p$ /HGBIC $_p$ ) further includes a $2d\log p$ term for model size $d$ .

Uniform consistency and model-selection consistency are established under mild high-dimensional assumptions, and explicit plug-in estimators of $H_n$ and $K_n$ are used in practical computation.

Table: Information Criteria for Model Selection in High-Dimensional Misspecified Models

Criterion	Penalty Structure	Misspecification Correction
Classic AIC/BIC	$2d$, $d\log n$	None
GAIC/GBIC	$2\text{tr}(H_n^{-1}K_n)$ , $d\log n-\log\|H_n^{-1}K_n\|$	Yes (sandwich, KL)
GBIC $_p$ /HGBIC $_p$	$2d\log p$ (GBIC $_p$ ), prior-based log-size	Yes (plus KL/prior and sandwich)

The sandwich correction is essential to avoid over-selection and poor predictive risk under nonlinearity or non-normality (Basu et al., 2014, Demirkaya et al., 2018).

3. Sufficient Conditions for Asymptotic Validity

Inference and selection procedures in high-dimensional misspecified linear models require regularity beyond classical settings. Typical assumptions include:

Design matrix: minimal and maximal eigenvalues of $\Sigma=\mathrm{Cov}(X^{(0)})$ are bounded above and below; uniform boundedness of covariates and nodewise regression residuals.
Sparsity: both the pseudo-true parameter $\beta^0$ and nodewise regression coefficients must be sufficiently sparse, typically $o(\sqrt{n}/\log p)$ .
Error moments: (conditional) sub-Gaussianity or finite $(2+\delta)$ -moment for errors.
Compatibility/restricted eigenvalue: ensuring $\ell_1$ regularized estimators are well behaved.
Nondegeneracy: residual variance of score corrections must be bounded away from zero.

These conditions ensure that rates of convergence match those under well-specified models and that robust procedures (debiased Lasso, DEF inference) deliver reliable uncertainty quantification (Bühlmann et al., 2015, Shah et al., 2019).

4. Interpretation and Scope of Inference

Under misspecification, population regression coefficients lack a straightforward causal or structural interpretation. The correct target is always the $L_2$ -projection of the (potentially nonlinear) regression function onto the span of $X$ :

$\beta^0 = \arg\min_\beta E[ f^0(X) - X^T\beta ]^2.$

Hypothesis tests and confidence intervals thus pertain to these “pseudo-true” parameters. For Gaussian predictors, the non-zeros of $\beta^0$ often correspond to active predictors in $f^0$ ’s support. In fixed-design scenarios, confidence intervals obtained are uniform across all possible sufficiently sparse approximations of $f^0$ by $X$ (Bühlmann et al., 2015).

For model-selection, sandwich-based generalized information criteria ensure that procedures select sparse models approximating $f^0$ in $L_2$ -risk, not necessarily “true” support.

5. Extensions and Practical Recommendations

Measurement error: Extensions of DEF methodologies handle measurement error under double-robustness, ensuring that inference for a component of $\beta$ is valid as long as either the outcome or exposure model is correctly specified and sparse (Cui et al., 2024).
Prediction inference with general/dense loadings: Wald-type inference for predictions $a^T\hat\beta$ (with dense $a$ ) is possible in high-dimensional misspecified models, provided sparsity of the precision matrix or inverse Hessian instead of $\beta$ itself (Liang et al., 15 Jul 2025).
Semi-parametric and double-robust inference: For missing data or potential outcome models, double-robust $M$ -estimation and desparsified inference are valid for projection parameters as long as one of the nuisance models (propensity or outcome regression) is correctly specified and sparse (Chakrabortty et al., 2019, Dukes et al., 2019).
Validation and software: Implementations such as the “hdi” R package provide both classical and robust sandwich variance estimation for de-sparsified Lasso (Bühlmann et al., 2015).

Recommended practice is to explicitly treat $X$ as random (unconditional inference) or fixed (conditional), check for compatibility via pilot fits, and employ sandwich variance estimation if model misspecification or heteroskedasticity/nonlinearity is suspected.

High-dimensional F-statistics: The classical $F$ -test for submodel regression remains asymptotically valid even under gross misspecification of the submodel under precise high-dimensional conditions, effectively testing whether the best linear predictor is constant (Leeb et al., 2019).
High-dimensional mixed models: REML estimation of variance components remains consistent for error variance and for total heritability in sparse/structured random-effects models, even under full-model misspecification; after suitable scaling, random-effect variance is consistently estimated (Jiang et al., 2014, Dao et al., 2021).
Bayesian inference: Standard Bayesian posteriors can be inconsistent under high-dimensional misspecification, exhibiting non-concentration or hypercompression. The SafeBayes algorithm rectifies this by adaptively flattening the likelihood (learning rate $\eta < 1$ ), restoring consistency for prediction and calibration (Grünwald et al., 2014).

7. Empirical Performance and Simulation Findings

Simulations across a range of misspecification scenarios consistently show:

Sandwich-based, debiased, and DEF-type methods maintain nominal coverage for confidence intervals even under nonlinearity or heteroskedasticity, while naive or classical approaches can substantially undercover.
Model-selection procedures with explicit sandwich and KL penalty (GBIC $_p$ , HGBIC $_p$ ) avoid over-selection and spurious discoveries in ultra-high-dimensional settings, outperforming AIC/BIC when the linear model is only an approximation (Basu et al., 2014, Demirkaya et al., 2018).
GCV-tuned ridge regression provides prediction risk close to the oracle across a broad range of misspecification, outperforming Lasso for Gaussian designs (Shinkyu, 20 Jan 2026).
In GWAS and mixed model contexts, REML error variance estimates are consistent regardless of sparsity; genetic variance estimates may be “shrunk” by sparsity but are consistent after scaling (Jiang et al., 2014, Dao et al., 2021).

In summary, high-dimensional misspecified linear models constitute a well-defined regime where robust inference, selection, and prediction are achievable by targeting projection parameters and employing sandwich correction, double-robust or de-sparsified procedures. These methods accommodate both the high-dimensionality and the model uncertainty that inevitably arise in complex applications, with validated theoretical guarantees and practical implementations (Bühlmann et al., 2015, Shah et al., 2019, Basu et al., 2014, Liang et al., 15 Jul 2025, Chakrabortty et al., 2019, Cui et al., 2024).