Spectrum-Aware Early Stopping

Updated 29 January 2026

Spectrum-aware early stopping is an iterative regularization method that leverages eigenspectrum analysis to determine stopping times, balancing bias and variance.
It transforms gradient-based updates into spectral filters, enabling provable risk bounds and improved generalization without the need for separate validation data.
Applicable to linear models, neural networks, and inverse problems, it uses spectral decomposition to tailor optimization and enhance robustness against overfitting.

Spectrum-aware early stopping is an approach to regularization and generalization control in iterative training procedures, that leverages spectral decomposition or domain-specific frequency characteristics to set, analyze, and optimize stopping rules. This methodology is applicable across gradient-based optimization for linear models, neural networks, and inverse problems, and in deep learning frequency-domain regularization. By exploiting eigenspectrum information, spectrum-aware stopping can yield provable risk bounds, adaptivity to data hardness, and robustness against overfitting, often without the need for validation sets.

1. Foundational Principles

Spectrum-aware early stopping is anchored by the recognition that gradient descent or similar iterative optimization procedures can be interpreted as spectral filters: each eigendirection of the relevant operator (covariance, kernel, or system matrix) is attenuated at a rate prescribed by its eigenvalue and the step-size schedule. For linear least-squares regression, stopping discrete GD at finite time in a model $y = X\beta + \epsilon$ is shown to be equivalent to a form of generalized (per-eigenvalue) ridge regularization: the solution along each eigenmode $\lambda_j$ receives effective regularization parameter $\mu_j$ determined by the spectral 'transfer' function $\phi_j(T)$ (Sonthalia et al., 2024). In the context of neural networks, spectrum-awareness extends to the frequency domain via phase–amplitude decomposition of intermediate model representations (PADDLES) (Huang et al., 2022), and to eigenanalysis of the Neural Tangent Kernel (NTK) for analytic stopping time prescriptions (Xavier et al., 2024).

In inverse problems, conjugate gradient iterates can be viewed as adaptive polynomial spectral filters acting on the system matrix, with filtering strength evolving as a function of the zeros of the residual polynomial; stopping CG at a well-chosen iteration balances approximation and stochastic error in terms of the observed eigenspectrum (Hucker et al., 2024).

2. Spectral-Filtering Dynamics and Closed-Form Trajectories

The progression of parameter estimates under iterative methods such as GD or CG can be captured in closed form using spectral decomposition. For least-squares regression, denote $X^\top X = V\Lambda V^\top$ , with $\Lambda = \text{diag}(\lambda_1,\ldots,\lambda_p)$ and $V$ orthonormal. The GD iterate after $k$ steps with schedule $\{\eta_k\}$ is

$\beta_k = V\Phi(k) V^\top \beta_0 + [I - V\Phi(k)V^\top](\lambda n I + X^\top X)^{\dagger} X^\top y,$

where $\Phi(k) = \text{diag}(\phi(k;\lambda+\lambda_j))/\phi(0;\lambda+\lambda_j)$ and $\phi(k;\zeta) = \prod_{i=1}^k(1-\eta_i \zeta)$ (Sonthalia et al., 2024). Each coordinate behaves independently, with spectral attenuation determined by its eigenvalue.

In CG-based inverse problems, each iterate is representable as $x_m = p_m(A^\top A)A^\top Y$ , where $p_m$ is a degree- $m$ polynomial filter matched to the spectral distribution of $A^\top A$ . A continuous interpolation $x(t)$ between iterates allows precise bias–variance tracking by virtue of the changing polynomial zero locations $x_{i,t}$ , which determine filtering strength per direction (Hucker et al., 2024).

Neural networks under NTK dynamics undergo training error decay governed by the NTK spectrum. For two-layer ReLU networks, a single GD step along eigenvector $u_1$ with eigenvalue $\lambda_1$ produces an explicit reduction in error norm and an analytic stopping time $T^* = (1-\gamma_1)/\lambda_1^-$ , where $\gamma_1$ is a function of initial error projection and NTK parameters (Xavier et al., 2024).

3. Spectrum-Aware Stopping Criteria and Risk Minimization

The principal goal of spectrum-aware early stopping is to minimize excess risk or generalization error, leveraging spectrum-adapted stopping rules:

Least Squares Regression:

Minimization of expected excess risk $R(\beta_k)$ under arbitrary schedule $\{\eta_k\}$ yields, for coordinate $j$ ,

$E R(\beta_k) = \sum_{j=1}^p \lambda_{\text{test},j} \phi_j(k)^2 \Delta_j^2 + \tau^2 \sum_{j=1}^r \lambda_{\text{test},j} \frac{(1-\phi_j(k))^2}{\lambda_j},$

with optimal stopping $\phi_j(k^*) \approx \tau^2 / [\tau^2 + \lambda_j \Delta_j^2]$ and $k^* \sim (1/\eta \lambda_j)\log[1 + (\sigma^2/\tau^2) \lambda_j]$ (Sonthalia et al., 2024).

Conjugate Gradients for Inverse Problems:

Define a "balanced oracle" stopping time $t^*$ such that approximation and stochastic error coincide. Under regularity and eigenvalue decay $\lambda_i \sim i^{-p}$ , the prediction risk achieves minimax rate at $t^*$ (Hucker et al., 2024). A practical rule is to halt when the residual norm drops below noise level threshold $\kappa$ ; this is shown to yield oracle-type bounds for prediction risk.

Neural Networks (NTK):

A closed-form spectrum-aware stopping time for one-step GD is prescribed as

$T^* = \frac{1-\gamma_1}{\lambda_1^-},$

with $\gamma_1$ determined from projections of the initial error onto the leading NTK eigenvector and explicit network constants. Population loss is bounded above by a function of $A_1', B_1', \lambda_1, \|v(0)\|$ (Xavier et al., 2024).

Frequency-Domain Deep Learning (PADDLES):

Separate stopping epochs for amplitude and phase spectra are selected by monitoring spectral surrogate losses $\mathcal{L}_{AS}(t)$ and $\mathcal{L}_{PS}(t)$ , with $T_{AS}$ set at early minima and $T_{PS}$ possibly later. Empirical evidence shows improved robustness to label noise by decoupling phase and amplitude fitting (Huang et al., 2022).

4. Algorithmic Realizations and Empirical Evidence

Spectrum-aware early stopping is instantiated algorithmically by the following representative procedures:

Generalized Ridge Equivalence (Least Squares): Run full-batch GD for $T$ steps and equivalently solve a per-eigenvalue penalized regression with $\mu_j$ extracted from transfer functions $\phi_j(T)$ . This framework allows for flexible learning-rate schedules and arbitrary spectral profiles (Sonthalia et al., 2024).
PADDLES (Deep CNNs): At a chosen intermediate layer, perform DFT to disentangle amplitude $(\mathcal{AS}_\chi)$ and phase $(\mathcal{PS}_\chi)$ , freeze gradients selectively, and set stopping epochs $T_A$ , $T_P$ via spectral surrogate metrics or validation curves. Results on synthetic and real-world noisy-label data (e.g., CIFAR-10/100/10N/100N, Clothing-1M) consistently show SOTA or near-SOTA test accuracy improvements, outperforming prior early-stopping and sample-selection methods (Huang et al., 2022).
CGNE for Inverse Problems: Monitor the spectral discrepancy (residual norm), halt at threshold matching noise expectation, and interpret each iterate as a spectrum-adaptive polynomial filter. Simulation demonstrates optimal prediction/reconstruction convergence rates, with practical rules matching oracle performance (Hucker et al., 2024).
NTK-Based Networks: Compute top NTK eigenpairs, initial error projections, and apply a single spectrum-scheduled GD step at $T^*$ . Empirical studies (e.g., Van der Pol oscillator imitation via MPC) verify tight consistency between analytic risk bounds and empirical test loss minima, with substantially reduced training time (Xavier et al., 2024).

5. Comparative Analysis: Spectrum-Aware vs. Traditional Early Stopping

Aspect	Traditional (Validation Splitting)	Spectrum-Aware
Data utilization	Hold-out validation subset	All training data
Selection heuristic	Monitor validation loss ('patience')	Closed-form $T^*$ or spectral rule
Guarantees	Empirical, unguaranteed	Provable risk bounds
Hyperparameter sensitivity	High (patience, validation noise)	Low (adaptive to spectrum, minimal tuning)
Computational cost	Repeated per-epoch validation	Eigenvalue decomposition or spectral filtering upfront

Spectrum-aware methods avoid the need for arbitrary 'patience', hyperparameter search, and validation splits. They theoretically guarantee risk reduction under explicit spectral regularization, and adapt stopping times based on problem ‘hardness’ as revealed by spectrum (e.g., top NTK eigenvalue, decay exponent, noise level).

6. Extensions, Open Questions, and Future Directions

Potential extensions include:

Adaptive Stopping: Automation of stopping time selection via spectral gradient statistics, without held-out data. Running average metrics in Fourier or eigenspace enable label-free generalization control (Huang et al., 2022).
Multi-Layer Spectrum Decomposition: Insert DFTs or eigendecomposition at multiple model depths, enabling hierarchically scheduled stopping for semantic vs. textural representation classes (Huang et al., 2022).
Modalities Beyond Vision: Application to speech (1D-DFT on hidden states), text, and time-series domains. Early experiments report consistent gains (Huang et al., 2022).
Adversarial Robustness: Spectrum-aware stopping combined with adversarial training can further suppress high-frequency perturbations and ensure robust generalization (Huang et al., 2022).

Open questions include the derivation of optimal spectral metrics for semantic generalization, interactions with contrastive/self-supervised pretraining, and tight generalization bounds quantifying phase-dominant robustness.

A plausible implication is that spectrum-aware early stopping will become standard practice in domains where eigenspectrum or frequency-domain analysis is tractable, such as kernel methods, linear inverse problems, deep vision, and semi-supervised learning.

7. Summary of Key Results and Theoretical Guarantees

Early stopping in least-squares regression with arbitrary spectrum and learning rates achieves regularization equivalent to a generalized, per-eigendirection ridge, yielding precise signal/noise trade-off and risk decomposition (Sonthalia et al., 2024).
In the CG setting, polynomial filter evolution adaptively follows the spectrum, with stopping rules provably achieving minimax rates, and residual-based practical rules matching oracle performance (Hucker et al., 2024).
For neural networks, NTK-theory enables one-step spectrum-adaptive early stopping with analytic population-loss bounds, guaranteed to avoid overfitting in underparameterized settings and validated against simulation (Xavier et al., 2024).
Phase-amplitude decomposition in deep vision (PADDLES) leverages distinct spectral learning rates for semantic robustness under noisy labels, attaining state-of-the-art test accuracy and extensibility to other modalities (Huang et al., 2022).

Spectrum-aware early stopping offers a rigorously grounded, spectrally-adaptive approach for balancing approximation and generalization error, with widespread impact across statistical learning, deep neural architectures, and inverse problem regularization.