NTK-ECRN: Eig-Controlled Residual Networks

Updated 10 February 2026

NTK-ECRN is a deep residual network that explicitly controls the NTK spectrum using Fourier features, variable layer scaling, and stochastic depth.
The architecture stabilizes optimization by bounding eigenvalue growth, ensuring improved generalization and reliable performance across regression and classification tasks.
Empirical results show NTK-ECRN outperforms standard models with lower error and higher accuracy, bridging infinite-width theory and practical neural architectures.

The NTK-Eigenvalue-Controlled Residual Network (NTK-ECRN) is a deep residual architecture designed for explicit, layerwise control over the spectral properties of its Neural Tangent Kernel (NTK). By integrating Fourier feature embeddings, residual blocks with variable scaling, and stochastic depth, NTK-ECRN enables analytic and empirical study of NTK dynamics, particularly the evolution and conditioning of its eigenvalues during training. This approach both extends infinite-width neural tangent theory and yields practical, robust architectures for deep learning across regression and classification settings (Mysore et al., 9 Dec 2025).

1. Architecture and Core Design Components

NTK-ECRN is an $L$ -layer residual network with finite (potentially large) hidden width $n$ , instantiated as

$h^{(l)} = h^{(l-1)} + \alpha_l \, \sigma\big(W^l h^{(l-1)} + b^l\big)$

for $l=1,\ldots,L$ , where $\alpha_l > 0$ introduces explicit per-block scaling. The network features three principal components:

Fourier Feature Embeddings: The input $x \in \mathbb{R}^d$ is mapped via fixed or trainable frequencies $B \in \mathbb{R}^{d_f \times d}$ to

$\phi(x) = [\sin(2\pi Bx), \cos(2\pi Bx)] \in \mathbb{R}^{2d_f}$

to amplify high-frequency modes in the input and mitigate the NTK's standard spectral bias.

Residual Scaling: Layerwise $\alpha_l$ control the magnitude of each block's update, directly modulating the NTK's spectral increments and eigenvalue growth.
Stochastic Depth: Each residual block is dropped with probability $p_l$ , leading to

$h^{(l)} = h^{(l-1)} + m_l \alpha_l \sigma(W^l h^{(l-1)} + b^l)$

with $m_l \sim \mathrm{Bernoulli}(1-p_l)$ serving as a regularizer and source of NTK stability.

Parameters are initialized (“standard NTK initialization”) as $W_{ij}^l \sim \mathcal{N}(0, 1/n)$ and $b_i^l \sim \mathcal{N}(0, 1/n)$ , which enforces kernel convergence in the infinite-width regime. The output is computed by a final linear layer.

2. NTK Definition and Kernel Spectral Evolution

The NTK at training iteration $t$ is

$K_t(x, x') = \nabla_\theta f_\theta^{(t)}(x) \cdot \nabla_\theta f_\theta^{(t)}(x') = \sum_{l=1}^L \frac{\partial f_\theta^{(t)}(x)}{\partial W^l} \frac{\partial f_\theta^{(t)}(x')}{\partial W^l}^\top$

For a dataset, the Gram matrix $\Theta_t$ encodes the NTK between all pairs of training points. The growth of the NTK norm and its eigenvalues is constrained by the architecture: $\|\Theta_{t+1} - \Theta_0\|_F \leq \|\Theta_t - \Theta_0\|_F + \alpha_l^2 \|\sigma'\|_\infty^2$ iterated over all blocks and steps,

$\|\Theta_t - \Theta_0\|_F \leq t \cdot \max_{l} (\alpha_l^2 \|\sigma'\|_\infty^2)$

Eigenvalues evolve according to: $\lambda_{\max}(\Theta^{(l+1)}_t) \leq \lambda_{\max}(\Theta^{(l)}_t) + \alpha_l^2 \|J_t^{(l)}\|_2^2$ with $J_t^{(l)}$ the Jacobian of block $l$ . Per Weyl's inequality, for all $i$ ,

$|\lambda_i(A+B) - \lambda_i(A)| \leq \|B\|_2$

ensuring that increments in the NTK have bounded impact on all eigenmodes.

3. Spectral Shaping, Generalization, and Optimization Stability

Modulation of the NTK spectrum has several key consequences:

Generalization: Decomposing outputs along NTK eigenvectors, convergence under gradient flow is

$f_i(t) = f_i(0) - \eta \lambda_i(t) (f_i(t) - y_i)$

with a bound on generalization error: $\mathcal{E}_\text{gen} \leq \sum_{i=1}^n \frac{(f_i - y_i)^2}{\lambda_i} + \varepsilon$ where $\varepsilon$ accounts for finite-width effects. Larger eigenvalues along informative directions reduce the penalty and yield better interpolation.

Stability/Conditioning: Ensuring moderate condition number $\kappa(\Theta_t) = \lambda_1(t)/\lambda_n(t)$ is essential for stable optimization. Control of $\{\alpha_l\}$ and $\{p_l\}$ prevents runaway behavior ("edge-of-stability": rapid $\lambda_1$ spikes) and thus secures robust gradient descent.
Fourier and residual scaling roles: Fourier features flatten the initial eigenvalue decay (enhancing representation of high frequencies), while increasing $\alpha_l$ selectively boosts high-frequency modes at the cost of possible spectral instability if not carefully capped.

The NTK-ECRN advances over classical ResNets and FC architectures by offering explicit and quantitative eigenvalue control:

In overparameterized ResNets, the skip-connection structure was shown to constrain the operator norm of layer propagation, giving width requirements polynomial rather than exponential in depth and maintaining a strictly positive smallest eigenvalue at initialization and during training (Li et al., 2020).
Spectral analysis of the residual NTK (ResNTK) in the infinite-width limit demonstrates that the kernel is diagonalized by spherical harmonics, with eigenvalues decaying as $k^{-d}$ in input dimension $d$ . The "spikiness" of the spectrum is controlled by the skip-to-residual weight $\alpha$ ; constant $\alpha$ induces spike-like sharpening as depth grows, whereas scaling $\alpha = 1/L$ ensures a depth-invariant, stable spectrum (Belfer et al., 2021).
At finite width, fluctuations around the infinite-width kernel (and its spectrum) are $O(1/n)$ , and the condition number remains tightly controlled if the sum of $\{\alpha_l\}$ and depth $L$ are chosen to satisfy $\frac{5m+\sum_l\alpha_l}{n}\lesssim 1$ . The standard "FixUp" scaling $\alpha_l=1/L$ achieves this flat spectrum, while intentionally larger $\alpha_l$ can be used to adjust spectral decay or condition number (Littwin et al., 2020).

5. Empirical Performance, Metrics, and Spectrum Evolution

The performance of NTK-ECRN is validated empirically against MLPs, ResNet-18, and infinite-width predictors. Representative results include:

Model	MSE (↓)	$R^2$ (↑)	Accuracy (%) (↑)	CE Loss (↓)
MLP (512)	$0.085\pm0.007$	$0.91\pm0.02$	$87.3\pm1.1$	$0.412\pm0.015$
ResNet-18	$0.072\pm0.006$	$0.93\pm0.01$	$89.5\pm0.9$	$0.385\pm0.012$
Standard NTK	$0.090\pm0.008$	$0.90\pm0.02$	$85.7\pm1.3$	$0.425\pm0.017$
NTK-ECRN	$\mathbf{0.045\pm0.004}$	$\mathbf{0.92\pm0.01}$	$\mathbf{93.8\pm0.7}$	$\mathbf{0.312\pm0.010}$

In CIFAR-10 (5,000 images), NTK-ECRN achieves $81.9\%$ accuracy and $0.648$ cross-entropy loss, outperforming all baselines. Empirical kernel evolution exhibits smooth, predictable growth in the largest eigenvalue and linear $\|\Theta_t - \Theta_0\|_F$ scaling, in contrast to the instability and sharp spectral spikes observed in standard architectures (Mysore et al., 9 Dec 2025).

6. Significance, Extensions, and Limitations

NTK-ECRN establishes a functional bridge between infinite-width kernel theory and practical, scalable architectures. Fourier feature embedding shapes the initial spectrum; residual scaling constrains kernel drift and eigenvalue growth; and stochastic depth both regularizes and renders analytic study tractable. These architectural controls enable:

Adaptation of spectral properties during training, promising for high-frequency learning tasks
Consistent generalization and stability across depths and widths
Potential extensions, including adaptive per-layer scaling based on live NTK estimates or integration with standard normalization techniques

A central limitation remains the non-negligible effect of finite-width-induced fluctuations as network width decreases. Existing theoretical bounds for $\|\Theta_t - \Theta_0\|$ become less sharp in narrow settings, motivating further work on non-asymptotic kernel evolution (Mysore et al., 9 Dec 2025, Li et al., 2020, Belfer et al., 2021, Littwin et al., 2020).