Papers
Topics
Authors
Recent
Search
2000 character limit reached

NTK-ECRN: Eig-Controlled Residual Networks

Updated 10 February 2026
  • NTK-ECRN is a deep residual network that explicitly controls the NTK spectrum using Fourier features, variable layer scaling, and stochastic depth.
  • The architecture stabilizes optimization by bounding eigenvalue growth, ensuring improved generalization and reliable performance across regression and classification tasks.
  • Empirical results show NTK-ECRN outperforms standard models with lower error and higher accuracy, bridging infinite-width theory and practical neural architectures.

The NTK-Eigenvalue-Controlled Residual Network (NTK-ECRN) is a deep residual architecture designed for explicit, layerwise control over the spectral properties of its Neural Tangent Kernel (NTK). By integrating Fourier feature embeddings, residual blocks with variable scaling, and stochastic depth, NTK-ECRN enables analytic and empirical study of NTK dynamics, particularly the evolution and conditioning of its eigenvalues during training. This approach both extends infinite-width neural tangent theory and yields practical, robust architectures for deep learning across regression and classification settings (Mysore et al., 9 Dec 2025).

1. Architecture and Core Design Components

NTK-ECRN is an LL-layer residual network with finite (potentially large) hidden width nn, instantiated as

h(l)=h(l1)+αlσ(Wlh(l1)+bl)h^{(l)} = h^{(l-1)} + \alpha_l \, \sigma\big(W^l h^{(l-1)} + b^l\big)

for l=1,,Ll=1,\ldots,L, where αl>0\alpha_l > 0 introduces explicit per-block scaling. The network features three principal components:

  • Fourier Feature Embeddings: The input xRdx \in \mathbb{R}^d is mapped via fixed or trainable frequencies BRdf×dB \in \mathbb{R}^{d_f \times d} to

ϕ(x)=[sin(2πBx),cos(2πBx)]R2df\phi(x) = [\sin(2\pi Bx), \cos(2\pi Bx)] \in \mathbb{R}^{2d_f}

to amplify high-frequency modes in the input and mitigate the NTK's standard spectral bias.

  • Residual Scaling: Layerwise αl\alpha_l control the magnitude of each block's update, directly modulating the NTK's spectral increments and eigenvalue growth.
  • Stochastic Depth: Each residual block is dropped with probability plp_l, leading to

h(l)=h(l1)+mlαlσ(Wlh(l1)+bl)h^{(l)} = h^{(l-1)} + m_l \alpha_l \sigma(W^l h^{(l-1)} + b^l)

with mlBernoulli(1pl)m_l \sim \mathrm{Bernoulli}(1-p_l) serving as a regularizer and source of NTK stability.

Parameters are initialized (“standard NTK initialization”) as WijlN(0,1/n)W_{ij}^l \sim \mathcal{N}(0, 1/n) and bilN(0,1/n)b_i^l \sim \mathcal{N}(0, 1/n), which enforces kernel convergence in the infinite-width regime. The output is computed by a final linear layer.

2. NTK Definition and Kernel Spectral Evolution

The NTK at training iteration tt is

Kt(x,x)=θfθ(t)(x)θfθ(t)(x)=l=1Lfθ(t)(x)Wlfθ(t)(x)WlK_t(x, x') = \nabla_\theta f_\theta^{(t)}(x) \cdot \nabla_\theta f_\theta^{(t)}(x') = \sum_{l=1}^L \frac{\partial f_\theta^{(t)}(x)}{\partial W^l} \frac{\partial f_\theta^{(t)}(x')}{\partial W^l}^\top

For a dataset, the Gram matrix Θt\Theta_t encodes the NTK between all pairs of training points. The growth of the NTK norm and its eigenvalues is constrained by the architecture: Θt+1Θ0FΘtΘ0F+αl2σ2\|\Theta_{t+1} - \Theta_0\|_F \leq \|\Theta_t - \Theta_0\|_F + \alpha_l^2 \|\sigma'\|_\infty^2 iterated over all blocks and steps,

ΘtΘ0Ftmaxl(αl2σ2)\|\Theta_t - \Theta_0\|_F \leq t \cdot \max_{l} (\alpha_l^2 \|\sigma'\|_\infty^2)

Eigenvalues evolve according to: λmax(Θt(l+1))λmax(Θt(l))+αl2Jt(l)22\lambda_{\max}(\Theta^{(l+1)}_t) \leq \lambda_{\max}(\Theta^{(l)}_t) + \alpha_l^2 \|J_t^{(l)}\|_2^2 with Jt(l)J_t^{(l)} the Jacobian of block ll. Per Weyl's inequality, for all ii,

λi(A+B)λi(A)B2|\lambda_i(A+B) - \lambda_i(A)| \leq \|B\|_2

ensuring that increments in the NTK have bounded impact on all eigenmodes.

3. Spectral Shaping, Generalization, and Optimization Stability

Modulation of the NTK spectrum has several key consequences:

  • Generalization: Decomposing outputs along NTK eigenvectors, convergence under gradient flow is

fi(t)=fi(0)ηλi(t)(fi(t)yi)f_i(t) = f_i(0) - \eta \lambda_i(t) (f_i(t) - y_i)

with a bound on generalization error: Egeni=1n(fiyi)2λi+ε\mathcal{E}_\text{gen} \leq \sum_{i=1}^n \frac{(f_i - y_i)^2}{\lambda_i} + \varepsilon where ε\varepsilon accounts for finite-width effects. Larger eigenvalues along informative directions reduce the penalty and yield better interpolation.

  • Stability/Conditioning: Ensuring moderate condition number κ(Θt)=λ1(t)/λn(t)\kappa(\Theta_t) = \lambda_1(t)/\lambda_n(t) is essential for stable optimization. Control of {αl}\{\alpha_l\} and {pl}\{p_l\} prevents runaway behavior ("edge-of-stability": rapid λ1\lambda_1 spikes) and thus secures robust gradient descent.
  • Fourier and residual scaling roles: Fourier features flatten the initial eigenvalue decay (enhancing representation of high frequencies), while increasing αl\alpha_l selectively boosts high-frequency modes at the cost of possible spectral instability if not carefully capped.

The NTK-ECRN advances over classical ResNets and FC architectures by offering explicit and quantitative eigenvalue control:

  • In overparameterized ResNets, the skip-connection structure was shown to constrain the operator norm of layer propagation, giving width requirements polynomial rather than exponential in depth and maintaining a strictly positive smallest eigenvalue at initialization and during training (Li et al., 2020).
  • Spectral analysis of the residual NTK (ResNTK) in the infinite-width limit demonstrates that the kernel is diagonalized by spherical harmonics, with eigenvalues decaying as kdk^{-d} in input dimension dd. The "spikiness" of the spectrum is controlled by the skip-to-residual weight α\alpha; constant α\alpha induces spike-like sharpening as depth grows, whereas scaling α=1/L\alpha = 1/L ensures a depth-invariant, stable spectrum (Belfer et al., 2021).
  • At finite width, fluctuations around the infinite-width kernel (and its spectrum) are O(1/n)O(1/n), and the condition number remains tightly controlled if the sum of {αl}\{\alpha_l\} and depth LL are chosen to satisfy 5m+lαln1\frac{5m+\sum_l\alpha_l}{n}\lesssim 1. The standard "FixUp" scaling αl=1/L\alpha_l=1/L achieves this flat spectrum, while intentionally larger αl\alpha_l can be used to adjust spectral decay or condition number (Littwin et al., 2020).

5. Empirical Performance, Metrics, and Spectrum Evolution

The performance of NTK-ECRN is validated empirically against MLPs, ResNet-18, and infinite-width predictors. Representative results include:

Model MSE (↓) R2R^2 (↑) Accuracy (%) (↑) CE Loss (↓)
MLP (512) 0.085±0.0070.085\pm0.007 0.91±0.020.91\pm0.02 87.3±1.187.3\pm1.1 0.412±0.0150.412\pm0.015
ResNet-18 0.072±0.0060.072\pm0.006 0.93±0.010.93\pm0.01 89.5±0.989.5\pm0.9 0.385±0.0120.385\pm0.012
Standard NTK 0.090±0.0080.090\pm0.008 0.90±0.020.90\pm0.02 85.7±1.385.7\pm1.3 0.425±0.0170.425\pm0.017
NTK-ECRN 0.045±0.004\mathbf{0.045\pm0.004} 0.92±0.01\mathbf{0.92\pm0.01} 93.8±0.7\mathbf{93.8\pm0.7} 0.312±0.010\mathbf{0.312\pm0.010}

In CIFAR-10 (5,000 images), NTK-ECRN achieves 81.9%81.9\% accuracy and $0.648$ cross-entropy loss, outperforming all baselines. Empirical kernel evolution exhibits smooth, predictable growth in the largest eigenvalue and linear ΘtΘ0F\|\Theta_t - \Theta_0\|_F scaling, in contrast to the instability and sharp spectral spikes observed in standard architectures (Mysore et al., 9 Dec 2025).

6. Significance, Extensions, and Limitations

NTK-ECRN establishes a functional bridge between infinite-width kernel theory and practical, scalable architectures. Fourier feature embedding shapes the initial spectrum; residual scaling constrains kernel drift and eigenvalue growth; and stochastic depth both regularizes and renders analytic study tractable. These architectural controls enable:

  • Adaptation of spectral properties during training, promising for high-frequency learning tasks
  • Consistent generalization and stability across depths and widths
  • Potential extensions, including adaptive per-layer scaling based on live NTK estimates or integration with standard normalization techniques

A central limitation remains the non-negligible effect of finite-width-induced fluctuations as network width decreases. Existing theoretical bounds for ΘtΘ0\|\Theta_t - \Theta_0\| become less sharp in narrow settings, motivating further work on non-asymptotic kernel evolution (Mysore et al., 9 Dec 2025, Li et al., 2020, Belfer et al., 2021, Littwin et al., 2020).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NTK-Eigenvalue-Controlled Residual Networks.