Linearized Neural Network Approximation

Updated 2 February 2026

Linearized approximation is defined as replacing a neural network’s nonlinear behavior with its first-order Taylor expansion around a reference parameter.
The approach underpins analyses like the neural tangent kernel regime and random feature models, enabling scalable learning and efficient uncertainty quantification.
Practical algorithms based on linearization reduce computational cost in tasks such as robustness verification and PDE solving, while exposing limits in feature learning.

A linearized approximation for neural networks refers to replacing the nonlinear predictive model induced by a neural network with its first-order Taylor expansion around a reference parameter set, typically at initialization or after gradient-based pretraining. This construction underlies a broad class of modern theoretical analyses, including the neural tangent kernel (NTK) regime, random features models, linearized Laplace approximations for Bayesian prediction, and practical methods in PDE solving and robustness verification. The approach is pivotal both as an analytical tool and in algorithmic surrogates for scalable learning and inference in high-dimensional, overparameterized settings.

1. Mathematical Foundation of Linearized Neural Network Models

Let $f_\theta(x)$ denote a neural network with parameters $\theta\in\mathbb{R}^p$ . The linearized model at a reference parameter $\theta_0$ is given by the first-order Taylor expansion: $f^{\text{lin}}_\theta(x) = f_{\theta_0}(x) + \nabla_\theta f_{\theta_0}(x) \cdot (\theta - \theta_0)$ where $\nabla_\theta f_{\theta_0}(x)$ is the Jacobian evaluated at $\theta_0$ . When the parameters are initialized randomly and either held fixed (random features) or allowed to drift minimally under gradient flow (NTK regime), the function space induced by $f^{\text{lin}}_\theta(x)$ is linear in $\theta$ , although the feature map $x\mapsto \nabla_\theta f_{\theta_0}(x)$ may be highly nonlinear in $x$ .

This linearization is central in high-dimensional asymptotic analysis, such as in the NTK theory and its generalizations (Ghorbani et al., 2019), and forms the basis of kernel surrogates and linear Gaussian-process approximations (Maddox et al., 2021, Ortega et al., 2023).

2. Regimes and Expressivity: Random Features, NTK, and Approximation Limits

Linearized approximations manifest concretely in two-layer networks as either random features (RF) models (fix first-layer, fit second-layer only) or NTK models (expand around initialization, fit the tangent dynamics). In high-dimensional analysis on the sphere, for a target function $f_\star$ and $n$ samples in $d$ dimensions, expressivity is characterized via polynomial degrees:

Approximation-limited regime ( $n=\infty$ , $N$ finite, $d$ large): For random features, the model fits degree- $\ell$ harmonics if $d^{\ell+\delta} \leq N \leq d^{\ell+1-\delta}$ ; for NTK, degree- $(\ell+1)$ (Ghorbani et al., 2019). The polynomial threshold is dictated by the block-diagonal structure of the Gram matrices in the spherical harmonic basis.
Sample-limited regime ( $N=\infty$ , $n$ finite, $d$ large): Kernel ridge regression methods generalize only up to degree- $\ell$ if $d^{\ell+\delta}\leq n\leq d^{\ell+1-\delta}$ .
Saturation theorem: For ReLU $^k$ networks on the sphere with smoothness $r>(d+2k+1)/2$ , linearized approximation cannot converge faster than $n^{-(d+2k+1)/(2d)}$ , matching the upper bounds and establishing tight expressivity limits (Mao et al., 5 Oct 2025).
Integral representation in Sobolev spaces: Any $f\in H^{(d+2k+1)/2}$ admits an $L^2$ -weighted integral over ridge functions, and linearized networks (fixed hidden, fit outer) achieve the optimal rate $O(n^{-1/2-(2k+1)/(2d)})$ (Liu et al., 1 May 2025).

This analysis demonstrates that linearized two-layer networks realize random polynomial feature expansions, with expressivity growing in quantized steps as width or sample size crosses powers of $d$ . Nonlinear training of all parameters breaks these polynomial-basis bottlenecks, accessing strictly greater function classes (Ghorbani et al., 2019).

3. Origin and Validity of Linearization: Weak-Correlation Principle

The apparent linearity of parameter dynamics in wide networks is induced by weak correlations between the first and higher-order derivatives of $f(\theta;x)$ , evaluated at initialization. For gradient-based training with a small learning rate $\eta\sim O(1/n)$ , Taylor expansion around $\theta_0$ yields that higher-order terms vanish as depth and width increase, provided correlation tensors $\mathcal{C}^{D,d}$ between $\partial^{D+d}f$ and multiple copies of $\partial f$ decay as $O(1/m(n))$ or $O(1/\sqrt{m(n)})$ (Shem-Ur et al., 2024).

Under this weak-correlation assumption, the entire training trajectory is well-approximated by first-order (NTK) dynamics, and deviations from linear dynamics are suppressed as $O(1/m(n))$ . At finite width, corrections scale as $O(1/\sqrt{n})$ , but as width increases, the regime of validity for the linearized approximation extends to longer training times per $(T\cdot O(1/\sqrt{n}))\ll1$ (Shem-Ur et al., 2024).

4. Linearized Models in Bayesian Prediction and Uncertainty Quantification

The linearized Laplace approximation (LLA) for Bayesian neural networks places a Gaussian posterior around a MAP-pretrained parameter, with covariance given by the generalized Gauss-Newton (GGN) matrix. In function space, this corresponds to a Gaussian process with kernel $K_{\mathrm{LLA}}(x,x') = J(x) \Sigma J(x')^T$ , formally matching a neural tangent kernel at the trained weights (Deng et al., 2022, Ortega et al., 2023).

Computational bottlenecks in LLA originate from needing to form and invert large Jacobian and Gram matrices. Recent advances introduce:

Nyström acceleration (ELLA): Nyström approximation of the NTK reduces computational cost by using low-rank feature approximations constructed via Jacobian-vector products and spectral decompositions. This enables scalable Bayesian inference and uncertainty estimation, even for large-scale vision transformers, retaining competitive likelihood and calibration properties (Deng et al., 2022).
Surrogate neural kernels (ScaLLA): A compact network $h_\phi$ is trained to match the NTK in inner-product space via random projections, yielding scalable LLA inference with improved calibration and out-of-distribution detection (Ortega et al., 29 Jan 2026).
Variational sparse GPs (VaLLA): RKHS-dual variational inference yields exact LLA predictive means with per-step complexity independent of $N$ . The predictive mean remains that of the original network, and posterior covariance is matched through optimized inducing points (Ortega et al., 2023).

These advances enable exact-form Bayesian posteriors and fast uncertainty quantification for pre-trained DNNs without forming the full NTK or GGN.

5. Practical Algorithms Utilizing Linearized Approximations

Linearized Subspace Refinement (LSR): Given a trained network, a first-order (linearized) residual model is constructed at the trained weights, with parameter corrections sought in a low-dimensional Jacobian-induced subspace. Solving the direct least-squares problem in this subspace typically yields order-of-magnitude test-error reductions compared to the SGD baseline, owing to improved numerical conditioning. Iterative LSR alternates linear correction with nonlinear retraining for composite loss settings (Cao et al., 20 Jan 2026).
Linearized shallow models for PDEs: In high-dimensional PDE solving, linearized shallow networks (random features or fixed deterministic features) reduce the learning problem to linear least squares. Collocation-based least-squares methods are preferred to variational (Galerkin) approaches due to better conditioning. High accuracy for ReLU $^k$ or tanh features does not require randomization; deterministic quasi-uniform grids suffice and achieve optimal approximation rates (Mao et al., 16 Jan 2026, Liu et al., 1 May 2025).
Fast adaptation and transfer via linearized GPs: By treating the linearized network as a GP (with kernel given by the Jacobian), domain adaptation and uncertainty quantification reduce to GP posterior inference. Implicit Jacobian-vector products and scalable inference via conjugate gradients and Fisher-vector products allow analytic, convex, and scalable transfer without weight retraining on the target domain (Maddox et al., 2021).

6. Limitations, Higher-Order Approximations, and Extensions

While linearized approximations offer compelling theoretical and practical utility, several fundamental limitations exist:

They cannot capture non-polynomial features or exploit first-layer weight movement ("feature learning"), which is essential for expressive tasks such as learning single-neuron activations unless the width is exponential in dimension (Ghorbani et al., 2019).
The "saturation" effects in expressivity are fundamental: with ReLU $^k$ activations, linearized models cannot achieve more than $O(n^{-(d+2k+1)/(2d)})$ error regardless of target smoothness, limiting their advantage even over classical finite element methods (Mao et al., 5 Oct 2025).
The main source of non-linearity in practice, including kernel evolution and the resulting alignment with training labels, is suppressed in the NTK/linearized regime and only realized at finite width or under non-infinitesimal training (Ortiz-Jiménez et al., 2021).
Quadratic and higher-order Taylorized models close the gap between NTK-based theory and full network training: the $k$ -th order Taylorized model yields exponentially decreasing approximation error in $k$ at wide but finite width (with cost scaling quickly in $k$ due to nested Jacobian-vector products) (Bai et al., 2020).
Systematic randomization can be used to "kill" lower-order Taylor terms and focus optimization on quadratic or higher-order components, improving sample complexity and expressivity over linearized regimes with polynomial rates in dimension (Bai et al., 2019).

7. Linearized Approximation in Robustness Verification and Certification

For robustness certification, linear over-approximation of activations, especially for sigmoid-like nonlinearities, underpins the propagation of output bounds through neural networks for formal adversarial robustness certification. The notion of network-wise tightest linear bounds, as opposed to heuristic neuron-wise criteria, has been formalized and shown to yield up to 251% larger certified bounds in certain regimes. Exact network-wise optimality is achievable in one-layer and non-negative weight networks via closed-form or convex optimization over the linear surrogate (Zhang et al., 2022).

This framework determines the layer-wise propagation of linear upper and lower envelopes, aligning tightness guarantees with tractable computational pipelines for LP-based certification.

References:

(Ghorbani et al., 2019, Shem-Ur et al., 2024, Zhang et al., 2022, Cao et al., 20 Jan 2026, Bai et al., 2020, Ortega et al., 29 Jan 2026, Ortega et al., 2023, Ortiz-Jiménez et al., 2021, Mao et al., 5 Oct 2025, Deng et al., 2022, Mao et al., 16 Jan 2026, Liu et al., 1 May 2025, Maddox et al., 2021, Bai et al., 2019)