- The paper demonstrates that a single large gradient update creates a rank-one spiked Gaussian effect that aligns learned features with the target function.
- It reveals that the induced kernel deformations selectively amplify target-aligned eigenvalues while preserving the decay structure of the initial isotropic kernel.
- Empirical results confirm that such dynamic feature learning improves generalization performance, matching kernel regression in high-dimensional regimes.
Function-Space View of Feature Learning in Two-Layer Neural Networks
Introduction and Motivation
This paper addresses a fundamental theoretical gap in understanding how feature learning mechanisms in two-layer neural networks (NNs) reshape function spaces during optimization. Unlike the "lazy" regime of kernel methods or neural tangent kernel (NTK) training, modern NNs exhibit dynamic feature learning, a property ascribed to the joint adaptation of both layer weights. The analysis presented precisely quantifies how a single large gradient step in a high-dimensional regime modifies the functions that can be efficiently represented by a two-layer network, connecting these dynamics to changes in induced kernels and their spectral properties.
Theoretical Framework
The authors frame their study within a high-dimensional proportional regime, utilizing a Gaussian single-index model as the generative data process. They examine a two-layer NN with an activation σ(t) and contrast its function-space dynamics with those of random feature models (RFMs), which correspond to fixed, data-independent features and thus fixed kernel-induced Reproducing Kernel Hilbert Spaces (RKHS).
The central technical device is the analysis of a single gradient step on the first-layer weights, which yields an updated weight distribution that is shown to be well-approximated by a spiked Gaussian, where the spike is aligned with the target function. This yields a new data-adaptive kernel, denoted k1​, characterizing the reshaped function space after learning.
Main Results
After a single large gradient step, the distribution of learned features can be described by a rank-one spiked Gaussian covariance, introducing a target-dependent anisotropy in the parameter space. The induced post-update kernel k1​ can be written as
k1​(x,x′)=Ew∼N(0,T)​[σ(⟨w,x⟩)σ(⟨w,x′⟩)]
with T being an explicit spiked covariance matrix incorporating the target direction. This transformation can equivalently be viewed as a data transformation: the "pushforward" formulation shows k1​(x,x′)=k0​(T1/2x,T1/2x′). The function-space modification is therefore a distributional shift in either parameter or input space, inducing a data-adaptive RKHS.
Spectral Expansion and Dominant Modes
A Taylor-like (Price's theorem-based) expansion of k1​ around the isotropic initialization kernel k0​ demonstrates that the leading-order effect of feature learning is a linear perturbation that preferentially amplifies components aligned with the target. Higher-order terms correspond to increasingly nonlinear projections onto the target, but their contribution vanishes in the high-dimensional limit for moderate step sizes.
The paper further shows that for ReLU networks, the structure of the spectrum is analytically tractable. The spiked kernel mixes the top isotropic radial eigenfunction with target-aligned quadratic harmonics, effectively boosting directions most correlated with the informative signal.
Explicit results include:
- Selective amplification of eigenvalues: Linear functions in the direction of the target w∗ receive an increased eigenvalue, while those orthogonal to w∗ are unaffected.
- Mixing in the top eigenspace: The top eigenfunction of the operator after feature learning becomes a superposition of the constant harmonic and a degree-2 zonal harmonic aligned with k1​0.
- No premature feature deactivation: The spectrum of the updated operator exhibits the same decay rate as the initialization, but with selective enhancement.
Numerical Validation
Empirical results match theoretical predictions: experiments with two-layer ReLU networks show increasing alignment between dominant empirical kernel eigenvectors and the target-aligned quadratic feature as the spike magnitude grows. Furthermore, generalization performance of trained NNs tracks that of kernel regression using the spiked kernel, albeit with the NN having to learn the target from data rather than being given privileged access.
Implications and Future Directions
This work precisely quantifies how early feature learning steps in neural networks dynamically modulate the geometry of the induced function space. Rather than purely scaling an initial isotropic kernel, gradient-based learning induces structured deformations aligned with signal directions in the data, substantially enhancing representational capacity. The theoretical results provide a concrete mechanism for how neural networks overcome the limitations of fixed kernel methods, particularly in high-dimensional single-index models where kernel methods require infinite data for precise learning.
Looking forward, several implications and directions emerge:
- Algorithmic initialization: The analysis suggests that initialization schemes targeting early feature alignment (distribution shifts mimicking early gradient steps) could accelerate training.
- Beyond ReLU/two-layer: Extending the spectral and function-space dynamics to deeper architectures and other activations remains open.
- Higher-order terms and strong feature learning: In regimes where higher-order terms remain significant (e.g., very large gradient steps), the nonlinear geometry of the function space may allow richer or unexpected behaviors.
- Theoretical limits: Characterizing when and how these data-adaptive deformations enable or limit learnability for more complex generative models is a natural continuation.
Conclusion
The paper establishes a rigorous function-space perspective on early feature learning in two-layer neural networks, demonstrating that gradient updates induce spectral and geometric deformations in the induced kernel that align and amplify directions informative about the target function. These results bridge the gap between kernel methods and neural networks and inform both practical algorithm design and theoretical understanding of learning dynamics in high dimensions (2605.17718).