Papers
Topics
Authors
Recent
Search
2000 character limit reached

High-Dimensional Misspecified Linear Models

Updated 27 January 2026
  • High-dimensional misspecified linear models are defined as linear approximations in settings where p exceeds n, targeting the pseudo-true projection for robust estimation.
  • They employ methods like de-sparsified Lasso and double-estimation-friendly (DEF) inference to achieve valid confidence intervals and hypothesis tests despite misspecification.
  • Advanced model selection techniques using generalized AIC/BIC with sandwich corrections enhance sparse model recovery and predictive performance under nonlinearity and heteroskedasticity.

High-dimensional misspecified linear models refer to statistical frameworks in which the observed data are modeled with a linear structure in high-dimensional settings (pnp \gg n), but the true underlying data-generating mechanism is not necessarily linear. Despite the model misspecification, much recent research has focused on valid inference, robust estimation, variable/model selection, and uncertainty quantification in such regimes. The following entry synthesizes key theoretical foundations, methodology, interpretational issues, and recent developments.

1. Definition and Formal Structure

Let {(Y(i),X(i))}i=1n,\{(Y^{(i)},X^{(i)})\}_{i=1}^n, with Y(i)RY^{(i)}\in\mathbb{R} and X(i)RpX^{(i)}\in\mathbb{R}^p, be the observed data. The working model is

Y=Xβ+ε,βRp.Y = X\beta + \varepsilon, \qquad \beta\in\mathbb{R}^p.

In the well-specified case, there exists β\beta^* such that Y(i)=(X(i))Tβ+ε(i)Y^{(i)} = (X^{(i)})^T\beta^* + \varepsilon^{(i)} with E[ε(i)X(i)]=0E[\varepsilon^{(i)}|X^{(i)}]=0 and Var(ε(i)X(i))=σ2\mathrm{Var}(\varepsilon^{(i)}|X^{(i)})=\sigma^2. In the misspecified case, the true data-generating process is Y(i)=f0(X(i))+ξ(i)Y^{(i)} = f^0(X^{(i)}) + \xi^{(i)}, where f0f^0 is arbitrary and E[ξ(i)X(i)]=0E[\xi^{(i)}|X^{(i)}]=0 (random design). The parameter of inferential interest then becomes the pseudo-true projection

β0=argminβRpE[f0(X(0))(X(0))Tβ]2,\beta^0 = \arg\min_{\beta\in\mathbb{R}^p} E\bigl[f^0(X^{(0)}) - (X^{(0)})^T\beta\bigr]^2,

which is the L2L_2 best linear approximation of f0f^0 (Bühlmann et al., 2015). Under fixed design and full column-rank XX, the (possibly sparse) representation is β0=argminβ{β1:Xβ=f0(X)}\beta^0 = \arg\min_{\beta}\{\|\beta\|_1: X\beta = f^0(X)\}.

In practice, high dimensionality is quantified via pnp \gg n, and the framework encompasses both random and fixed design regimes.

2. Inference Methodology and Robust Estimation

Classical inference procedures often yield invalid results under misspecification, especially in high dimensions and for standard debiased or regularized estimators. Recent advances address these limitations by developing procedures that target meaningful population quantities (often the projection parameter β0\beta^0) and employ robust techniques.

De-sparsified Lasso

The de-sparsified (or debiased) Lasso is a two-step procedure:

  1. Compute the initial Lasso estimator

β^=argminβ{YXβ22/n+λβ1}\hat\beta = \arg\min_{\beta} \Bigl\{\|Y - X\beta\|_2^2/n + \lambda\|\beta\|_1\Bigr\}

  1. For each j=1,,pj=1,\ldots,p, estimate the approximate inverse of the design covariance via a nodewise Lasso regression to obtain residual direction ZjZ_j.
  2. Form the estimator:

b^j=β^j+1nZjT(YXβ^)\hat b_j = \hat\beta_j + \frac{1}{n} Z_j^T (Y - X\hat\beta)

Under design and sparsity regularity (see assumptions below), the normalized estimator is asymptotically normal:

nZjTXj/nωp;jj(b^jβj0)dN(0,1),ωp;jj2=E[Zj(0)ε(0)]2\sqrt{n} \frac{Z_j^T X_j/n}{\omega_{p;jj}}(\hat b_j - \beta^0_j) \xrightarrow{d} \mathcal{N}(0,1),\qquad \omega_{p;jj}^2 = E\bigl[Z_j^{(0)} \varepsilon^{(0)}\bigr]^2

The variance ωp;jj2\omega_{p;jj}^2 is consistently estimated using “sandwich” formulas robust to heteroskedasticity and nonlinearity (Bühlmann et al., 2015). Confidence intervals based on this estimator maintain nominal level regardless of whether the model is well-specified.

Double-Estimation-Friendly (DEF) Inference

The DEF property refers to test statistics whose null distribution is robust to misspecification as long as either the model for YX,ZY|X,Z or XZX|Z is correct. In high-dimensional settings, Shah and Bühlmann (Shah et al., 2019) propose DEF methods for hypothesis testing (e.g., conditional independence XYZX \perp Y|Z) and construction of confidence intervals for regression parameters. Their high-dimensional DEF test uses regularized residuals from both YY on ZZ and XX on ZZ, constructed via square-root Lasso regressions, with a regularized partial-correlation statistic:

TDEF=n(YZβ^Y)T(XZβ^X)YZβ^YXZβ^XT_{DEF} = \sqrt{n} \frac{ (Y - Z\hat\beta^Y)^T (X - Z\hat\beta^X) } { \|Y - Z\hat\beta^Y\| \|X - Z\hat\beta^X\| }

Type I error is controlled under H0H_0 if either nuisance is correctly specified and sparse. Coverage of confidence intervals is robust to nonlinearity and heteroskedasticity.

Model Selection under Misspecification

Conventional information criteria such as AIC or BIC are not robust under high-dimensional misspecification. Two lines of recent work (Lv & Liu (Basu et al., 2014), Lv et al. (Demirkaya et al., 2018)) obtain high-dimensional analogs of these criteria with explicit Kullback–Leibler risk expansions:

  • Generalized AIC (GAIC) and Generalized BIC (GBIC) replace the classical penalty by terms involving the trace and log-determinant of the “covariance contrast” matrix Hn1KnH_n^{-1}K_n, reflecting the sandwich variance inflation from misspecification.
  • Large-scale penalization (GBICp_p/HGBICp_p) further includes a 2dlogp2d\log p term for model size dd.

Uniform consistency and model-selection consistency are established under mild high-dimensional assumptions, and explicit plug-in estimators of HnH_n and KnK_n are used in practical computation.

Table: Information Criteria for Model Selection in High-Dimensional Misspecified Models

Criterion Penalty Structure Misspecification Correction
Classic AIC/BIC $2d$, dlognd\log n None
GAIC/GBIC 2tr(Hn1Kn)2\text{tr}(H_n^{-1}K_n), dlognlogHn1Knd\log n-\log|H_n^{-1}K_n| Yes (sandwich, KL)
GBICp_p/HGBICp_p 2dlogp2d\log p (GBICp_p), prior-based log-size Yes (plus KL/prior and sandwich)

The sandwich correction is essential to avoid over-selection and poor predictive risk under nonlinearity or non-normality (Basu et al., 2014, Demirkaya et al., 2018).

3. Sufficient Conditions for Asymptotic Validity

Inference and selection procedures in high-dimensional misspecified linear models require regularity beyond classical settings. Typical assumptions include:

  • Design matrix: minimal and maximal eigenvalues of Σ=Cov(X(0))\Sigma=\mathrm{Cov}(X^{(0)}) are bounded above and below; uniform boundedness of covariates and nodewise regression residuals.
  • Sparsity: both the pseudo-true parameter β0\beta^0 and nodewise regression coefficients must be sufficiently sparse, typically o(n/logp)o(\sqrt{n}/\log p).
  • Error moments: (conditional) sub-Gaussianity or finite (2+δ)(2+\delta)-moment for errors.
  • Compatibility/restricted eigenvalue: ensuring 1\ell_1 regularized estimators are well behaved.
  • Nondegeneracy: residual variance of score corrections must be bounded away from zero.

These conditions ensure that rates of convergence match those under well-specified models and that robust procedures (debiased Lasso, DEF inference) deliver reliable uncertainty quantification (Bühlmann et al., 2015, Shah et al., 2019).

4. Interpretation and Scope of Inference

Under misspecification, population regression coefficients lack a straightforward causal or structural interpretation. The correct target is always the L2L_2-projection of the (potentially nonlinear) regression function onto the span of XX:

β0=argminβE[f0(X)XTβ]2.\beta^0 = \arg\min_\beta E[ f^0(X) - X^T\beta ]^2.

Hypothesis tests and confidence intervals thus pertain to these “pseudo-true” parameters. For Gaussian predictors, the non-zeros of β0\beta^0 often correspond to active predictors in f0f^0’s support. In fixed-design scenarios, confidence intervals obtained are uniform across all possible sufficiently sparse approximations of f0f^0 by XX (Bühlmann et al., 2015).

For model-selection, sandwich-based generalized information criteria ensure that procedures select sparse models approximating f0f^0 in L2L_2-risk, not necessarily “true” support.

5. Extensions and Practical Recommendations

  • Measurement error: Extensions of DEF methodologies handle measurement error under double-robustness, ensuring that inference for a component of β\beta is valid as long as either the outcome or exposure model is correctly specified and sparse (Cui et al., 2024).
  • Prediction inference with general/dense loadings: Wald-type inference for predictions aTβ^a^T\hat\beta (with dense aa) is possible in high-dimensional misspecified models, provided sparsity of the precision matrix or inverse Hessian instead of β\beta itself (Liang et al., 15 Jul 2025).
  • Semi-parametric and double-robust inference: For missing data or potential outcome models, double-robust MM-estimation and desparsified inference are valid for projection parameters as long as one of the nuisance models (propensity or outcome regression) is correctly specified and sparse (Chakrabortty et al., 2019, Dukes et al., 2019).
  • Validation and software: Implementations such as the “hdi” R package provide both classical and robust sandwich variance estimation for de-sparsified Lasso (Bühlmann et al., 2015).

Recommended practice is to explicitly treat XX as random (unconditional inference) or fixed (conditional), check for compatibility via pilot fits, and employ sandwich variance estimation if model misspecification or heteroskedasticity/nonlinearity is suspected.

  • High-dimensional F-statistics: The classical FF-test for submodel regression remains asymptotically valid even under gross misspecification of the submodel under precise high-dimensional conditions, effectively testing whether the best linear predictor is constant (Leeb et al., 2019).
  • High-dimensional mixed models: REML estimation of variance components remains consistent for error variance and for total heritability in sparse/structured random-effects models, even under full-model misspecification; after suitable scaling, random-effect variance is consistently estimated (Jiang et al., 2014, Dao et al., 2021).
  • Bayesian inference: Standard Bayesian posteriors can be inconsistent under high-dimensional misspecification, exhibiting non-concentration or hypercompression. The SafeBayes algorithm rectifies this by adaptively flattening the likelihood (learning rate η<1\eta < 1), restoring consistency for prediction and calibration (Grünwald et al., 2014).

7. Empirical Performance and Simulation Findings

Simulations across a range of misspecification scenarios consistently show:

  • Sandwich-based, debiased, and DEF-type methods maintain nominal coverage for confidence intervals even under nonlinearity or heteroskedasticity, while naive or classical approaches can substantially undercover.
  • Model-selection procedures with explicit sandwich and KL penalty (GBICp_p, HGBICp_p) avoid over-selection and spurious discoveries in ultra-high-dimensional settings, outperforming AIC/BIC when the linear model is only an approximation (Basu et al., 2014, Demirkaya et al., 2018).
  • GCV-tuned ridge regression provides prediction risk close to the oracle across a broad range of misspecification, outperforming Lasso for Gaussian designs (Shinkyu, 20 Jan 2026).
  • In GWAS and mixed model contexts, REML error variance estimates are consistent regardless of sparsity; genetic variance estimates may be “shrunk” by sparsity but are consistent after scaling (Jiang et al., 2014, Dao et al., 2021).

In summary, high-dimensional misspecified linear models constitute a well-defined regime where robust inference, selection, and prediction are achievable by targeting projection parameters and employing sandwich correction, double-robust or de-sparsified procedures. These methods accommodate both the high-dimensionality and the model uncertainty that inevitably arise in complex applications, with validated theoretical guarantees and practical implementations (Bühlmann et al., 2015, Shah et al., 2019, Basu et al., 2014, Liang et al., 15 Jul 2025, Chakrabortty et al., 2019, Cui et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to High-Dimensional Misspecified Linear Models.