Neural Tangent Kernel Analysis
- Neural Tangent Kernel (NTK) is a theoretical framework that linearizes gradient descent dynamics in infinite-width neural networks.
- NTK analysis utilizes explicit kernel recurrences and spectral properties to predict convergence rates, generalization errors, and feature alignment.
- Efficient computation methods such as explicit differentiation and trace estimation enable practical evaluation across diverse architectures.
Neural Tangent Kernel (NTK) Analysis
The Neural Tangent Kernel (NTK) is a central concept for analyzing the optimization, generalization, and function-space learning dynamics of overparameterized neural networks in the infinite-width limit. The NTK formalism provides a kernel-theoretic framework that linearizes gradient descent dynamics and connects neural network training to kernel methods. NTK analysis encompasses explicit kernel formulas for a wide variety of architectures, fixed-point and spectral properties, extensions to finite width and practical computation, generalizations to deep and recurrent models, and the interplay of NTK structure with alignment, feature learning, and generalization performance.
1. Definition, Recursion, and Theory of the NTK
The NTK for a network with parameters is given by
i.e., the Gram matrix of parameter gradients evaluated at and (Engel et al., 2022). In the infinite-width limit, the NTK converges (almost surely) to a deterministic kernel governed only by the network architecture and activation, independent of initial parameter realization (Yang, 2020). For deep MLPs, explicit recursions yield the infinite-width NTK: where and are layer-wise covariances and derivative covariances computed via Gaussian expectations over the output of the previous layer (Yang, 2020, Lencevicius, 2022). For ReLU activations, these are further specified by the arc-cosine kernel family and their analytic recurrences (Han et al., 2021, Zandieh et al., 2021).
This framework extends to any architecture (convolutional, recurrent, attention-based) given by the tensor programs approach, which yields deterministic NTK limits via systematic Gaussian integration and tracks all forward and backward signal statistics (Yang, 2020). For matrix product state (MPS) tensor networks, the NTK converges to a kernel with explicit factorized structure in the infinite bond-dimension limit (Guo et al., 2021).
2. Training Dynamics and Linearization
In the NTK regime (infinite width, fixed depth), gradient descent on the parameters induces a linear evolution in function space: for mean squared error loss over training points (Mysore et al., 9 Dec 2025, Chen et al., 2020, Lencevicius, 2022). The kernel remains nearly constant during the entire course of training in this regime, reducing the dynamics to kernel gradient descent and guaranteeing exponential convergence at rates set by the spectrum of .
For finite-width or deep networks, this constancy breaks down: the NTK may drift (sometimes significantly), introducing feature learning outside the kernel regime (Seleznova et al., 2022, Guillen et al., 15 Aug 2025). As depth increases relative to width, the NTK dispersion and its rate of change are controlled by order/chaos phase transitions of the network's initialization hyperparameters (Seleznova et al., 2022). In the "lazy training" regime for MPS and certain deep networks, NTK remains approximately constant during training, with almost all parameter updates vanishing in the infinite-width/bond limit (Guo et al., 2021).
3. Spectral Structure, Alignment, and Feature Learning
The spectrum of the NTK matrix, i.e., its eigenvalues and eigenvectors , determines convergence rates, generalization, and function-class bias (Mysore et al., 9 Dec 2025, Shan et al., 2021, Khalafi et al., 2023). Modes with larger converge more rapidly under kernel regression dynamics, and generalization errors scale inversely with the minimal eigenvalue (Mysore et al., 9 Dec 2025). The alignment of the NTK eigenspace with the target labels—the Kernel Target Alignment (KTA)
and its eigenvector-resolved form—quantifies how the kernel supports learning specific directions in label space (Jiang et al., 17 Jul 2025, Shan et al., 2021, Khalafi et al., 2023).
During standard training, especially at the "edge of stability" where the NTK's top eigenvalue hovers near the inverse step size, empirical results show that feature learning induces alignment of the NTK with the target: leading eigenvectors increasingly overlap with the labels as learning rate increases, yielding faster convergence and improved generalization (Jiang et al., 17 Jul 2025, Shan et al., 2021).
Specialization can also occur in multi-output scenarios, where the NTK decomposes into output-specific subkernels that align with their respective target functions (Shan et al., 2021).
4. Efficient Computation and Empirical Methods
Computing the NTK matrix directly is often computationally prohibitive, especially for large models. Diverse approaches address this bottleneck:
- Explicit differentiation for MLPs produces analytic closed-form layerwise expressions and achieves $100$– speedup compared to autodiff, with reduced memory requirements (Engel et al., 2022).
- Autodiff-based methods are general, supporting arbitrary architectures in frameworks like PyTorch (e.g., torchNTK), and can extract layerwise kernel blocks (Engel et al., 2022).
- Trace estimation techniques such as Hutch++ and one-sided Hutchinson estimators efficiently approximate NTK trace, Frobenius norm, effective rank, and kernel alignment metrics via randomized projections, enabling large-scale empirical NTK analysis even for recurrent and large models (Hazelden, 13 Nov 2025).
- Random features and sketching use arc-cosine features, leverage-score sampling, and count/tensor sketching to obtain low-dimensional linear embeddings with provable spectral approximations for NTKs (and convolutional NTKs). This reduces computation from to in many practical settings (Han et al., 2021, Zandieh et al., 2021).
- Dimensionality reduction via Johnson-Lindenstrauss projections and further matrix factorization can dramatically reduce both memory and computation, especially when input dimension is comparable to sample size (Ailon et al., 2022).
- Layerwise decompositions expose the contribution of each layer to the total NTK and enable memory-efficient routines in deep/narrow regimes (Engel et al., 2022).
Empirical validation confirms theoretical results on scaling, spectrum, and accuracy of the aforementioned methods, with order-of-magnitude speedups and matching or exceeding classic NTK implementations (Han et al., 2021, Zandieh et al., 2021, Hazelden, 13 Nov 2025).
5. Extensions: Generalized Settings and Non-Standard Architectures
NTK analysis extends to several advanced and non-standard situations:
- Mean-field and regularized regimes: NTK theory has been generalized to settings with weight decay and gradient noise, relaxing the requirement of weights remaining close to initialization by working in Wasserstein space of parameter distributions (Chen et al., 2020). This generalization allows for linear convergence and generalization even with regularization and gradient noise.
- Surrogate gradient learning: For non-differentiable activations (e.g., sign, spiking neurons), the classical NTK is ill-posed, but a "surrogate-gradient NTK" (SG-NTK) provides well-defined dynamics and theoretically grounded analysis for surrogate-gradient training (Eilers et al., 2024).
- Operator learning: NTK analysis for two-layer neural operators in the context of function space regression (surrogate PDE solvers) enables derivation of minimax-optimal convergence rates and explicit sample- and width-complexity requirements (Nguyen et al., 2024).
- Graph and tensor architectures: Analysis of NTKs for GNNs yields design principles for aligning the kernel eigenspace by optimizing the graph shift operator, with cross-covariance GSOs improving both convergence and generalization (Khalafi et al., 2023). For MPS tensor-network architectures, the NTK converges to a deterministic, positive-definite structure, guaranteeing training stability and admitting analytic solutions (Guo et al., 2021).
- Physics-informed and operator architectures: NTK analysis predicts the convergence superiority of physics-informed Kolmogorov-Arnold networks (PIKANs) over PINNs, with domain decomposition and optimizer choice directly linked to NTK spectral properties (Faroughi et al., 9 Jun 2025).
6. Finite-Width Effects and Corrections
While infinite-width NTK theory provides core insights, practical networks have finite width, yielding non-Gaussian corrections, NTK drift, and emergent feature learning. Feynman diagram formalism enables systematic calculation of $1/n$ corrections to NTK statistics, including higher-order objects such as dNTK and ddNTK, clarifying the depth stability and the vanishing of diagonal corrections for scale-invariant activations (e.g., ReLU and Leaky ReLU) (Guillen et al., 15 Aug 2025). Stability conditions, such as criticality of forward susceptibilities, ensure finite-width corrections do not explode with depth. Numerical experiments quantitatively validate these corrections and their absence/presence for specific architectures and activations.
In non-ordered (chaotic, edge-of-chaos) phase regimes, both variance at initialization and change during training scale exponentially with network depth relative to width, with significant implications for the validity of NTK-style analyses and the emergence of feature learning (Seleznova et al., 2022).
7. Connections, Equivalences, and Theoretical Insights
NTK analysis illuminates both theoretical and practical machine learning frontiers:
- Equivalence of NTK and Laplace kernels: On the unit sphere , the infinite-width ReLU NTK and the Laplace kernel have provably identical reproducing kernel Hilbert spaces, with empirical and posterior matchings confirming near-complete equivalence under normalization (Lencevicius, 2022). This suggests that Laplace kernel methods can substitute for NTK analysis in high-symmetry domains.
- Role in generalization: The NTK spectrum, especially the minimal eigenvalue, controls both convergence speed and generalization error bounds (Mysore et al., 9 Dec 2025, Nguyen et al., 2024). Alignment and specialization further sharpen these guarantees by focusing kernel learning power on task-relevant directions (Shan et al., 2021).
- Guiding architecture design: NTK-Eigenvalue-Controlled Residual Networks (NTK-ECRN) demonstrate how Fourier features, scaled residual connections, and stochastic depth can be used to precisely control NTK spectral properties, yielding empirically validated improvements in optimization and generalization (Mysore et al., 9 Dec 2025).
The NTK framework has thus become foundational for both rigorous analysis and informed engineering of deep learning models, spanning conventional architectures, operator learning, spiking networks, and structured networks in graph and tensor settings.
Key References:
- (Yang, 2020): "Tensor Programs II: Neural Tangent Kernel for Any Architecture"
- (Mysore et al., 9 Dec 2025): "Mathematical Foundations of Neural Tangents and Infinite-Width Networks"
- (Engel et al., 2022): "TorchNTK: A Library for Calculation of Neural Tangent Kernels of PyTorch Models"
- (Shan et al., 2021): "A Theory of Neural Tangent Kernel Alignment and Its Influence on Training"
- (Jiang et al., 17 Jul 2025): "Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability"
- (Seleznova et al., 2022): "Neural Tangent Kernel Beyond the Infinite-Width Limit: Effects of Depth and Initialization"
- (Guillen et al., 15 Aug 2025): "Finite-Width Neural Tangent Kernels from Feynman Diagrams"
- (Nguyen et al., 2024): "Optimal Convergence Rates for Neural Operators"
- (Eilers et al., 2024): "A generalized neural tangent kernel for surrogate gradient learning"
- (Guo et al., 2021): "Neural Tangent Kernel of Matrix Product States: Convergence and Applications"
- (Ailon et al., 2022): "Efficient NTK using Dimensionality Reduction"
- (Khalafi et al., 2023): "Neural Tangent Kernels Motivate Graph Neural Networks with Cross-Covariance Graphs"
- (Lencevicius, 2022): "An Empirical Analysis of the Laplace and Neural Tangent Kernels"
- (2140.01351, Zandieh et al., 2021): Random features and sketching for scalable kernel computation
- (Faroughi et al., 9 Jun 2025): Physics-informed Kolmogorov-Arnold Networks and convergence analysis via NTK
These works collectively underpin modern understanding and application of NTK analysis across deep learning theory and practice.