Eigenfunctions of the Neural Tangent Kernel
- NTK eigenfunctions are orthonormal basis functions derived from the NTK that characterize learning dynamics and spectral bias in wide neural networks.
- Spectral decomposition via Mercer's theorem and Nyström approximation enables precise empirical estimation of dominant modes and kernel-induced function spaces.
- Their dynamic alignment with target functions during gradient descent illustrates the interplay between spectral bias, shortcut feature dominance, and convergence rates.
The eigenfunctions of the Neural Tangent Kernel (NTK) play a central role in characterizing the learning dynamics, inductive bias, and feature selection properties of (infinitely) wide neural networks in both theory and practice. For a given data distribution and neural network architecture, the NTK defines a positive semidefinite kernel whose spectrum—its eigenvalues and eigenfunctions—governs not only how target functions are represented, but also how quickly various modes of the target function are learned and retained during optimization.
1. Mathematical Formulation and Integral Operator Structure
The NTK for a neural network with parameters is defined, at initialization, as the inner product of parameter-space gradients: Under suitable regularity conditions, induces an integral operator on : where is the input distribution. The associated eigenvalue problem seeks and such that
with normalization . By Mercer's theorem, the spectrum is discrete (for compact ), and the eigenfunctions form an orthonormal basis for (Geifman et al., 2020, Bowman et al., 2022). On data points, the Gram matrix provides an empirical analogue, with eigenvectors approximating on the dataset.
2. Spectral Decomposition and Empirical Computation
The NTK kernel admits the expansion
In practice, one computes the leading eigenpairs of the empirical NTK Gram matrix by standard numerical linear algebra (e.g., Lanczos or randomized SVD algorithms). At new inputs , the associated eigenfunction is obtained via the Nyström approximation: Empirical evaluations show that in finite-width settings, the eigenfunctions and eigenvalues evolve throughout training, with the top eigenfunctions dynamically aligning to the target function and serving as a compact basis for the network output (Kopitkov et al., 2019).
3. Analytical Characterizations in Key Data and Model Regimes
For fully connected networks and rotation-invariant distributions (e.g., uniform measure on ), the NTK is a zonal kernel: . The Mercer decomposition utilizes spherical harmonics : with eigenvalues
where are Gegenbauer expansion coefficients of and is the multiplicity. The eigenfunctions are the spherical harmonics, and asymptotically, the eigenvalues decay as , directly matching the spectrum of the Laplace kernel (Geifman et al., 2020). Thus, the NTK and Laplace kernel yield RKHSs with the same Sobolev regularity on the sphere.
In the case of multilayer linear networks and Gaussian mixture data,
where solves , with as the second-moment matrix of the mixture, and the eigenvalue is (Lim et al., 3 Feb 2026).
4. Dynamics of Learning and NTK Eigenfunctions
During gradient descent on mean squared error, each residual component along an NTK eigenfunction decays exponentially at a rate set by its eigenvalue: with corrections governed by the "damped deviations" framework in underparameterized regimes (Bowman et al., 2022). In overparameterized or NTK-linear regimes, the empirical spectrum aligns rapidly to its limiting form, and directions with large are fitted significantly faster. For finite-width networks, NTK eigenfunctions are not perfectly static but instead rotate during training to align the top spectrum with the target function (Kopitkov et al., 2019).
5. Spectral Bias, Shortcut Features, and Generalization
The NTK spectrum imposes a form of spectral bias: low-frequency (large-eigenvalue) modes are learned first, while high-frequency (small-eigenvalue) modes are fitted more slowly. In the presence of data with clustered structure or spurious correlation (e.g., shortcut features), the top NTK eigenfunctions can align with these dominant clusters or shortcuts. Their associated large eigenvalues ensure both rapid learning and persistent post-training influence, even after aggressive margin maximization (Lim et al., 3 Feb 2026). This provides a principled operator-theoretic explanation for shortcut feature dominance and the slowness of learning rare or nuanced modes.
Empirical studies with two-layer ReLU nets and deep architectures (e.g., ResNet-18) reveal that practical NTK eigenmodes extracted from finite-width models localize on spurious or dominant cluster features, while lower-eigenvalue modes recover more generalizable structure (Lim et al., 3 Feb 2026).
6. Role of Depth, Width, and Training Procedures
Deeper architectures achieve higher spectral alignment (i.e., higher for given ) with targets, improving convergence rates and generalization. Width, by contrast, exhibits diminishing returns for alignment in fixed data settings (Kopitkov et al., 2019). Further, practical training procedures such as learning rate decay induce monotonic jumps in eigenvalues and redistribute energy across the spectrum, with top eigenvalues growing more rapidly and the top eigenspaces remaining stable throughout learning. This preserves the basis functions that span the majority of the network output and ensures that the optimization proceeds efficiently along the dominant (high-) directions.
7. Interpretation and Broader Theoretical Consequences
The NTK eigenfunction perspective recasts neural network learning dynamics as a spectral filtering process in function space (Geifman et al., 2020, Bowman et al., 2022). The empirical finding that top NTK eigenfunctions form a stable, low-dimensional subspace capturing nearly all label and output variance throughout training (Kopitkov et al., 2019) clarifies why overparameterized networks generalize well on low-complexity functions, while also being susceptible to shortcut learning under distributional imbalance. The evolving NTK spectrum in finite-width networks can be interpreted as an "adaptive kernel method," in which the top spectrum is learned to match the structure of the target, providing both expressive power and implicit regularization.
Summary Table: Key Facts on NTK Eigenfunctions
| Aspect | Core Fact | Reference |
|---|---|---|
| Spectral basis | Eigenfunctions form orthonormal basis of | (Geifman et al., 2020) |
| Data regime: sphere | are spherical harmonics; | (Geifman et al., 2020) |
| Learning dynamics | Residuals along decay as | (Bowman et al., 2022) |
| Spectral bias | Large- (smooth/low-frequency) modes learned faster | (Bowman et al., 2022) |
| Shortcut alignment | Dominant clusters/spurious features align with top and large | (Lim et al., 3 Feb 2026) |
| NTK evolution | Top eigenspace rotates to align with target during training | (Kopitkov et al., 2019) |
| Depth effect | Greater depth enables better alignment and faster convergence | (Kopitkov et al., 2019) |
The NTK eigenfunction framework provides a precise spectral lens for interpreting training dynamics, feature selection, and inductive bias in both linear and deep nonlinear neural networks (Kopitkov et al., 2019, Geifman et al., 2020, Bowman et al., 2022, Lim et al., 3 Feb 2026).