Papers
Topics
Authors
Recent
Search
2000 character limit reached

Eigenfunctions of the Neural Tangent Kernel

Updated 10 February 2026
  • NTK eigenfunctions are orthonormal basis functions derived from the NTK that characterize learning dynamics and spectral bias in wide neural networks.
  • Spectral decomposition via Mercer's theorem and Nyström approximation enables precise empirical estimation of dominant modes and kernel-induced function spaces.
  • Their dynamic alignment with target functions during gradient descent illustrates the interplay between spectral bias, shortcut feature dominance, and convergence rates.

The eigenfunctions of the Neural Tangent Kernel (NTK) play a central role in characterizing the learning dynamics, inductive bias, and feature selection properties of (infinitely) wide neural networks in both theory and practice. For a given data distribution and neural network architecture, the NTK defines a positive semidefinite kernel whose spectrum—its eigenvalues and eigenfunctions—governs not only how target functions are represented, but also how quickly various modes of the target function are learned and retained during optimization.

1. Mathematical Formulation and Integral Operator Structure

The NTK for a neural network fθ(x)f_\theta(x) with parameters θ\theta is defined, at initialization, as the inner product of parameter-space gradients: K(x,x)=θfθ0(x)θfθ0(x).K(x, x') = \nabla_\theta f_{\theta_0}(x) \cdot \nabla_\theta f_{\theta_0}(x'). Under suitable regularity conditions, KK induces an integral operator TKT_K on L2(μ)L^2(\mu): (TKϕ)(x)=K(x,x)ϕ(x)dμ(x),(T_K \phi)(x) = \int K(x, x') \phi(x') \, d\mu(x'), where μ\mu is the input distribution. The associated eigenvalue problem seeks λ0\lambda \geq 0 and ϕL2(μ)\phi \in L^2(\mu) such that

K(x,x)ϕ(x)dμ(x)=λϕ(x),\int K(x, x') \phi(x') \, d\mu(x') = \lambda \phi(x),

with normalization ϕ(x)2dμ(x)=1\int \phi(x)^2 d\mu(x) = 1. By Mercer's theorem, the spectrum {λi,ϕi}\{\lambda_i, \phi_i\} is discrete (for compact X\mathcal{X}), and the eigenfunctions {ϕi}\{\phi_i\} form an orthonormal basis for L2(μ)L^2(\mu) (Geifman et al., 2020, Bowman et al., 2022). On NN data points, the Gram matrix Gij=K(xi,xj)G_{ij} = K(x_i, x_j) provides an empirical analogue, with eigenvectors viv_i approximating ϕi(xk)\phi_i(x_k) on the dataset.

2. Spectral Decomposition and Empirical Computation

The NTK kernel admits the expansion

K(x,x)=i=1λiϕi(x)ϕi(x).K(x, x') = \sum_{i=1}^\infty \lambda_i\, \phi_i(x) \phi_i(x').

In practice, one computes the leading kk eigenpairs of the empirical NTK Gram matrix by standard numerical linear algebra (e.g., Lanczos or randomized SVD algorithms). At new inputs xx, the associated eigenfunction is obtained via the Nyström approximation: ϕi(x)λi1j=1NK(x,xj)vi(j).\phi_i(x) \approx \lambda_i^{-1} \sum_{j=1}^N K(x, x_j) v_i(j). Empirical evaluations show that in finite-width settings, the eigenfunctions and eigenvalues evolve throughout training, with the top eigenfunctions dynamically aligning to the target function and serving as a compact basis for the network output (Kopitkov et al., 2019).

3. Analytical Characterizations in Key Data and Model Regimes

For fully connected networks and rotation-invariant distributions (e.g., uniform measure on Sd1S^{d-1}), the NTK is a zonal kernel: K(x,x)=H(xx)K(x, x') = H(x \cdot x'). The Mercer decomposition utilizes spherical harmonics Y,mY_{\ell, m}: K(x,x)==0m=1N(d,)λY,m(x)Y,m(x),K(x, x') = \sum_{\ell=0}^{\infty} \sum_{m=1}^{N(d, \ell)} \lambda_\ell\, Y_{\ell, m}(x) Y_{\ell, m}(x'), with eigenvalues

λ=aN(d,),\lambda_\ell = \frac{a_\ell}{N(d, \ell)},

where aa_\ell are Gegenbauer expansion coefficients of HH and N(d,)N(d, \ell) is the multiplicity. The eigenfunctions are the spherical harmonics, and asymptotically, the eigenvalues decay as Θ(d)\Theta(\ell^{-d}), directly matching the spectrum of the Laplace kernel (Geifman et al., 2020). Thus, the NTK and Laplace kernel yield RKHSs with the same Sobolev regularity on the sphere.

In the case of multilayer linear networks and Gaussian mixture data,

K(x,x)xx,ϕ(x)=xv,K(x, x') \propto x^\top x',\quad \phi(x) = x^\top v,

where vv solves Mv=avMv = a v, with MM as the second-moment matrix of the mixture, and the eigenvalue is λ=kπkσk2+a\lambda = \sum_k \pi_k \sigma_k^2 + a (Lim et al., 3 Feb 2026).

4. Dynamics of Learning and NTK Eigenfunctions

During gradient descent on mean squared error, each residual component along an NTK eigenfunction decays exponentially at a rate set by its eigenvalue: rt,ϕieλitr0,ϕi,\langle r_t, \phi_i \rangle \approx e^{-\lambda_i t} \langle r_0, \phi_i \rangle, with corrections governed by the "damped deviations" framework in underparameterized regimes (Bowman et al., 2022). In overparameterized or NTK-linear regimes, the empirical spectrum aligns rapidly to its limiting form, and directions with large λi\lambda_i are fitted significantly faster. For finite-width networks, NTK eigenfunctions are not perfectly static but instead rotate during training to align the top spectrum with the target function (Kopitkov et al., 2019).

5. Spectral Bias, Shortcut Features, and Generalization

The NTK spectrum imposes a form of spectral bias: low-frequency (large-eigenvalue) modes are learned first, while high-frequency (small-eigenvalue) modes are fitted more slowly. In the presence of data with clustered structure or spurious correlation (e.g., shortcut features), the top NTK eigenfunctions can align with these dominant clusters or shortcuts. Their associated large eigenvalues ensure both rapid learning and persistent post-training influence, even after aggressive margin maximization (Lim et al., 3 Feb 2026). This provides a principled operator-theoretic explanation for shortcut feature dominance and the slowness of learning rare or nuanced modes.

Empirical studies with two-layer ReLU nets and deep architectures (e.g., ResNet-18) reveal that practical NTK eigenmodes extracted from finite-width models localize on spurious or dominant cluster features, while lower-eigenvalue modes recover more generalizable structure (Lim et al., 3 Feb 2026).

6. Role of Depth, Width, and Training Procedures

Deeper architectures achieve higher spectral alignment (i.e., higher Et(y,k)E_t(y, k) for given kk) with targets, improving convergence rates and generalization. Width, by contrast, exhibits diminishing returns for alignment in fixed data settings (Kopitkov et al., 2019). Further, practical training procedures such as learning rate decay induce monotonic jumps in eigenvalues and redistribute energy across the spectrum, with top eigenvalues growing more rapidly and the top eigenspaces remaining stable throughout learning. This preserves the basis functions that span the majority of the network output and ensures that the optimization proceeds efficiently along the dominant (high-λ\lambda) directions.

7. Interpretation and Broader Theoretical Consequences

The NTK eigenfunction perspective recasts neural network learning dynamics as a spectral filtering process in function space (Geifman et al., 2020, Bowman et al., 2022). The empirical finding that top NTK eigenfunctions form a stable, low-dimensional subspace capturing nearly all label and output variance throughout training (Kopitkov et al., 2019) clarifies why overparameterized networks generalize well on low-complexity functions, while also being susceptible to shortcut learning under distributional imbalance. The evolving NTK spectrum in finite-width networks can be interpreted as an "adaptive kernel method," in which the top spectrum is learned to match the structure of the target, providing both expressive power and implicit regularization.

Summary Table: Key Facts on NTK Eigenfunctions

Aspect Core Fact Reference
Spectral basis Eigenfunctions {ϕi}\{\phi_i\} form orthonormal basis of L2(μ)L^2(\mu) (Geifman et al., 2020)
Data regime: sphere ϕi\phi_i are spherical harmonics; λ=aN(d,)\lambda_\ell = \frac{a_\ell}{N(d, \ell)} (Geifman et al., 2020)
Learning dynamics Residuals along ϕi\phi_i decay as eλite^{-\lambda_i t} (Bowman et al., 2022)
Spectral bias Large-λ\lambda (smooth/low-frequency) modes learned faster (Bowman et al., 2022)
Shortcut alignment Dominant clusters/spurious features align with top ϕi\phi_i and large λi\lambda_i (Lim et al., 3 Feb 2026)
NTK evolution Top eigenspace rotates to align with target during training (Kopitkov et al., 2019)
Depth effect Greater depth enables better alignment and faster convergence (Kopitkov et al., 2019)

The NTK eigenfunction framework provides a precise spectral lens for interpreting training dynamics, feature selection, and inductive bias in both linear and deep nonlinear neural networks (Kopitkov et al., 2019, Geifman et al., 2020, Bowman et al., 2022, Lim et al., 3 Feb 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Eigenfunction of NTK.