A Theory of Generalization in Deep Learning
Abstract: We present a non-asymptotic theory of generalization in deep learning where the empirical neural tangent kernel partitions the output space. In directions corresponding to signal, error dissipates rapidly; in the vast orthogonal dimensions corresponding to noise, the kernel's near-zero eigenvalues trap residual error in a test-invisible reservoir. Within the signal channel, minibatch SGD ensures that coherent population signal accumulates via fast linear drift, while idiosyncratic memorization is suppressed into a slow, diffusive random walk. We prove generalization survives even when the kernel evolves $\mathcal{O}(1)$ in operator norm, the full feature-learning regime. This theory naturally explains disparate phenomena in deep learning theory, such as benign overfitting, double descent, implicit bias, and grokking. Lastly, we derive an exact population-risk objective from a single training run with no validation data, for any architecture, loss, or optimizer, and prove that it measures precisely the noise in the signal channel. This objective reduces in practice to an SNR preconditioner on top of Adam, adding one state vector at no extra cost; it accelerates grokking by $5 \times$, suppresses memorization in PINNs and implicit neural representations, and improves DPO fine-tuning under noisy preferences while staying $3 \times$ closer to the reference policy.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
A simple guide to “A Theory of Generalization in Deep Learning”
What is this paper about?
This paper tries to answer a big question in AI: why do very large neural networks, which can easily memorize random data, still learn patterns that work on new, unseen data? The authors propose a new, easy-to-visualize way to think about training that explains this and several puzzling effects people observe when training deep networks. They also turn their theory into a simple training trick that can make models learn useful patterns faster and avoid memorizing noise.
What questions are the authors asking?
Here are the main questions, in everyday terms:
- How can a model that’s big enough to memorize everything still learn real patterns that generalize?
- What separates “real signal” from “random noise” during training?
- Can we understand generalization when the model’s internal features change a lot while training (not just tiny changes)?
- Can we predict how changes during training will affect performance on new data?
- Can we use these ideas to design a better training rule that focuses on what helps on new data, without needing a separate validation set?
How do they approach the problem? (Explained with analogies)
The authors look at training in “output space,” meaning they focus directly on how the model’s predictions move during training, not just on its weights. They introduce two big ideas:
1) The “signal channel” and the “reservoir”
- Imagine that your model’s prediction space is a large room with many directions you could move in.
- Training builds a kind of map that says which directions are effective for reducing loss. The authors call the useful directions the signal channel. The directions that don’t matter for test data form the reservoir.
- Signal channel: directions where the model’s learning actually reduces error in a way that shows up on new data.
- Reservoir: directions where training can reduce training error but those changes are invisible on test data—like shouting into a soundproof closet. Noise that gets pushed into the reservoir can’t hurt you on test data.
A core claim they prove: errors sitting in the reservoir cannot affect any test set. That’s why some “memorization” turns out to be harmless.
2) Drift vs. diffusion in minibatch SGD
- When you train with minibatches, each update has:
- Drift: the consistent, averaged push in the direction of true patterns (signal).
- Diffusion: the random wobble from batch-to-batch noise.
- Over time, drift accumulates steadily (like walking straight), while diffusion grows much more slowly (like a drunkard’s random steps). This means true patterns win over noise inside the signal channel.
3) Train–test coupling even with changing features
- In modern training, a network’s internal features change a lot. The authors show that, under the usual squared error loss, if you know how your predictions changed during training on the training set, you can exactly determine how they changed on the test set—on the actual run—despite strong feature changes.
- In short: training motion determines test motion through a simple linear rule, on the realized path of training, not just in a special “lazy” regime.
4) A practical training rule for population risk (no validation set needed)
- Using a “leave-one-out” idea, each training example takes a turn acting as a tiny test point against the rest of the batch. This lets you estimate how much your current update would help on new data.
- This leads to a simple per-parameter rule: only update a parameter when its average “push” (signal) is stronger than its jitter (noise). It’s like an automatic “signal-to-noise” (SNR) filter.
- This slots on top of Adam with one extra state vector and almost no extra cost.
What did they find, and why does it matter?
Main takeaways
- Generalization with feature learning: They prove generalization can hold even when the network’s “kernel” (a map of how weight changes affect outputs) changes a lot—this is the realistic, full feature-learning setting.
- Reservoir protects you: If training pushes noise into the reservoir, test performance doesn’t get worse, because the test side can’t “see” it. This explains “benign overfitting,” where a model fits training data perfectly but still does fine on test data.
- Drift beats noise: Within the useful signal channel, averaged signal builds quickly while noise grows slowly, so models lock onto real patterns rather than random quirks.
- Train–test coupling: With squared loss, training movement fully determines test movement on the actual run. This is a strong, practical link between what you see on training and what happens on test.
- One picture explains many mysteries:
- Benign overfitting: harmless fitting sits in the reservoir.
- Double descent: test error can rise and fall as capacity changes because noise can shift between channel and reservoir.
- Implicit bias: training naturally fills the signal channel from the strongest, simplest directions first.
- Grokking: the model first memorizes in the reservoir and later moves the learned structure into the signal channel—suddenly boosting test performance.
A simple training upgrade that works in practice
They turn their theory into a small change to Adam-like optimizers: a per-parameter SNR-style gate that updates a parameter only when its mean gradient is larger than its variability. This:
- Speeds up “grokking” on a math-like task by about 5× (learns the rule much faster).
- Reduces memorization in physics-informed neural networks (when initial data is noisy).
- Improves preference fine-tuning (DPO) when feedback labels are noisy, while staying closer to the original model’s behavior.
Why is this important?
- It gives a clear, intuitive reason why big networks can generalize: training separates signal from noise by directing noise into a harmless reservoir and letting true patterns accumulate faster.
- It works in the realistic setting where model features evolve strongly, not just in special cases.
- It unifies several puzzling training effects under one simple picture.
- It offers a practical, low-cost optimizer tweak that aims directly at improving performance on new data—without needing a validation set to steer training.
Bottom line
Think of training as navigating a big room of possible prediction changes. The model naturally finds a “signal channel” where useful changes help on new data and a “reservoir” where noise gets trapped and can’t hurt test performance. Minibatch training strengthens real patterns and weakens noise. Even when the model’s internal features change a lot, the way training predictions move tells you exactly how test predictions move. And you can bake all of this into a tiny optimizer change that focuses updates where signal beats noise—learning faster and more reliably.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of concrete gaps and open questions that the paper leaves unresolved, organized by theme to guide future work.
Theoretical scope and assumptions
- Extension beyond squared loss: Theorem 4 (train–test coupling) proves exact coupling only under quadratic loss with constant Hessian (, ). How to obtain analogous guarantees (or tight bounds) for non-quadratic losses with state-dependent Hessians (e.g., cross-entropy, contrastive objectives), where varies along the path?
- Nonsmooth architectures: All main derivations assume parameter–output maps. Many practical networks (ReLU, GELU with piecewise properties) are nonsmooth in parameters. Can the results be extended to piecewise-smooth networks (e.g., via generalized derivatives or almost-everywhere arguments)?
- Lipschitz requirements along the trajectory: The drift–diffusion decomposition relies on a Lipschitz bound for to control the second-order remainder. What concrete, verifiable conditions on architectures/initializations (e.g., spectral normalization, weight decay schedules) ensure these Lipschitz constants hold along realistic training trajectories?
- Stability hypothesis for variance decay: The drift bound on noise channels requires a “replace-two stability hypothesis” on projected gradients. This is nontrivial for deep, nonconvex models; when and how does this hold in practice, and can it be replaced with verifiable proxy conditions?
- Exact characterization of constants: Several results are scale-free (, ). What are the hidden constants and how do they depend on depth, width, norms, and data complexity? Without explicit constants, prescriptive guidance on hyperparameters remains limited.
Signal–reservoir decomposition and operator definitions
- Precise definition of test transfer operator: The paper asserts a shared factorization through for both train and test, but the displayed definition of uses , apparently omitting . A rigorous, unambiguous operator definition (e.g., ) is necessary to prevent ambiguity in proofs (e.g., reservoir invisibility) and for practical computation.
- Path dependence of the reservoir: The reservoir depends on the window and the realized trajectory. How sensitive are generalization predictions to the choice of window and training path (e.g., different optimizers, curricula, or augmentations)? Can one define a path-independent or terminal-model notion that approximates ?
- Conditions for nontrivial reservoir: When (in terms of data geometry and network capacity) does the reservoir have substantial dimension? Under what conditions does residual training error concentrate in the reservoir versus the signal channel, and how does this relate to observed near-zero training error in modern practice?
- Distribution shift: Reservoir test-invisibility is proved in-distribution. What happens under covariate or concept shift (distribution drift, domain adaptation)? When do “reservoir” directions for become test-visible under ?
Train–test coupling and bias–variance decomposition
- Beyond exact coupling: For non-quadratic losses or non-constant , what are tight, computable upper bounds on the remainder , and how do these depend on curvature, step size, and kernel drift? Can one obtain high-probability generalization bounds using the operator decomposition?
- Quantitative predictions for phenomena: The framework qualitatively explains double descent, implicit bias, and grokking. Can it yield quantitative predictors (e.g., onset time for grokking, peak location/height in double descent) from spectra of and ?
- Noise aligned with the signal channel: The only surviving variance term is the signal-channel component of label noise. When label noise aligns with population signal (structured, adversarial, or class-conditional noise), does the drift–diffusion separation still suppress memorization, and what are the failure thresholds?
Optimization and preconditioning
- Momentum and other optimizer states: The analysis accommodates PSD preconditioners , but momentum/Adam involve stateful updates beyond instantaneous preconditioning. How do coupling and reservoir results extend to dynamics with momentum, adaptive learning rates, and decoupled weight decay?
- Optimal non-diagonal preconditioners: The practical algorithm restricts to diagonal (per-parameter SNR gate), whereas the theoretical objective involves , which could be larger for non-diagonal . What are tractable blockwise or layerwise preconditioners that better approximate the optimal ?
- Invariance and parameterization: The gate is not invariant to reparameterizations (e.g., rescaling in weight–activation systems). How can the method be made invariant (e.g., via path-norm, Fisher, or natural-gradient metrics) without prohibitive cost?
Population-risk objective and implementation
- Small-batch regimes and estimator reliability: The gate threshold depends on minibatch variance , which is noisy at small . What are robust, unbiased estimators and confidence-adjusted thresholds that remain stable for microbatching or distributed data-parallel training?
- Non-i.i.d. sampling and multi-epoch training: The exchangeability lemma underpins the population-risk rate. How do violations (sampling without replacement, curriculum, replay buffers, strong augmentations) bias the objective, and what are principled multi-epoch corrections?
- Distributed and mixed-precision training: Maintaining an extra variance state per parameter introduces memory/communication overhead and numerical concerns in mixed precision. What are efficient, numerically stable estimators in large-scale distributed settings?
- Interaction with normalization layers: BatchNorm/LayerNorm alter gradient statistics and inter-parameter correlations. How does the per-parameter gate behave under such layers, and would unitwise or blockwise gates perform better?
Empirical validation and scope
- Scale and domain coverage: Experiments focus on PINNs, a small grokking setup, and one DPO fine-tuning task (3 seeds). Validation on large-scale vision/NLP pretraining, diverse architectures (CNNs, ViTs, large Transformers), and varied losses is needed to assess robustness and effect sizes.
- Robustness to different noise types: Results highlight label/preference noise. How does the gate perform with input noise, augmentation stochasticity, class-conditional/asymmetric label noise, or instance-dependent corruption?
- Trade-offs in rare-signal regimes: The SNR gate may suppress updates with low-mean, high-variance gradients, potentially discarding rare but useful signals (long-tail or few-shot features). What mechanisms mitigate this (e.g., annealed thresholds, adaptive priors)?
- Convergence and optimization dynamics: There is no convergence analysis of the proposed gated optimizer in nonconvex settings. Under what conditions does the gate preserve or improve convergence rates relative to baseline AdamW?
Connections and extensions
- Relation to margin-based implicit bias: How does the operator-based view connect quantitatively with max-margin results for separable data and other implicit bias theories? Can spectra predict margin growth?
- Data augmentation and invariances: Augmentations change effective training kernels and gradients. How do and the SNR gate interact with learned invariances, and can the framework guide augmentation policy design?
- Out-of-distribution generalization and robustness: Can the signal–reservoir decomposition be extended to certify or diagnose OOD behavior (e.g., characterize directions that are test-invisible in-distribution but test-visible OOD)?
- Practical computation of operators: While training bypasses explicit and , any diagnostic use (e.g., measuring signal-channel content) requires tractable approximations. What scalable estimators (e.g., stochastic trace, randomized projections) can monitor these operators during training?
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that follow directly from the paper’s findings and the proposed population-risk training rule (an SNR-style gate layered on top of Adam/AdamW with one extra state vector).
- Robust optimizer plugin (PopRisk-Adam/AdamW)
- Sector: Software/ML infrastructure; cross-cutting across healthcare, education, finance, robotics, scientific computing.
- What: Add a per-parameter SNR gate that updates parameter k only if μ_k² > σ_k²/(b−1), using a streaming EMA of gradient means/variances. This suppresses low-SNR (noise-dominated) updates.
- Tools/workflows: PyTorch/JAX/TensorFlow optimizer wrapper; integrates with existing training scripts; logging of gradient SNR per layer.
- Assumptions/dependencies:
- Exchangeable or approximately i.i.d. minibatches.
- Stable estimation of per-parameter gradient variance (via per-example gradients, micro-batching, or step-to-step variance EMAs).
- Small enough step sizes for first-order approximations; Lipschitzity along the trajectory.
- Most theoretically precise under squared loss; empirically applicable to standard losses.
- Noisy-preference alignment for LLMs (DPO/RLHF)
- Sector: AI alignment/NLP.
- What: Use the SNR gate during DPO/RLHF fine-tuning to dampen updates driven by noisy or inconsistent preferences, maintaining closer adherence to the reference policy while improving reward accuracy.
- Tools/workflows: Hugging Face Transformers + custom optimizer; integrated logging of “reward drift from base policy.”
- Assumptions/dependencies:
- Preference/noise is exchangeable across batches; reward model reasonably calibrated.
- On-policy/off-policy sampling does not break the exchangeability assumption too severely.
- Physics-Informed Neural Networks (PINNs) and implicit neural representations (INRs)
- Sector: Scientific computing/engineering; graphics/vision (NeRFs).
- What: Reduce memorization of noisy boundary/initial conditions or pixel-level noise by gating low-SNR parameter updates, improving convergence and robustness.
- Tools/workflows: PINN solvers with AdamW+PopRisk; NeRF/INR training pipelines.
- Assumptions/dependencies:
- PDE loss terms may have different scales; rescaling/normalization and stable variance estimates are important.
- Noisy signals are not systematically biased (i.e., no persistent coherent drift in the noise channel).
- Faster “grokking” on small/algorithmic datasets
- Sector: Education/research; program synthesis; formal language tasks.
- What: Accelerate the transition from memorization to generalization by favoring coherent drift over diffusive noise in updates.
- Tools/workflows: Transformer-based curriculum tasks; optimizer plugin.
- Assumptions/dependencies:
- Data distribution is stationary enough for exchangeability-based estimates to hold over training windows.
- Online generalization monitor (validation-minimal training)
- Sector: MLOps/model governance; privacy-sensitive domains (healthcare, finance).
- What: Track an online, unbiased rate of population-risk decrease per batch (Ω_B). Use it for early stopping, LR schedules, or optimizer preconditioning—without a dedicated validation set.
- Tools/workflows: “Generalization dashboard” that logs Ω_B, per-parameter SNR thresholds crossed, and fraction of gated parameters per step.
- Assumptions/dependencies:
- Exchangeability/leave-one-out identity used in expectation; domain shift weak to moderate.
- Regulators or internal QA will still require external holdout evaluation for deployment.
- Data selection and curriculum via gradient agreement
- Sector: Data engineering/MLOps.
- What: Prefer batches that maximize off-diagonal gradient agreement (Ω_B), i.e., coherent signal across samples; deprioritize batches with conflicting/noisy gradients.
- Tools/workflows: Dataloader that scores candidate samples or microbatches by gradient agreement; batch construction policy.
- Assumptions/dependencies:
- Access to efficient per-example gradient proxies or low-overhead approximations.
- Data pipelines capable of dynamic sampling.
- Safer fine-tuning guardrails (policy drift control)
- Sector: AI safety/governance.
- What: Track and bound policy deviation from a reference model by coupling the SNR gate with a “reward drift” or KL to reference, throttling updates when drift exceeds a budget.
- Tools/workflows: Training hooks that enforce drift thresholds; optimizer gate as a safety interlock.
- Assumptions/dependencies:
- Reliable measurement of drift (e.g., reward or KL metrics).
- Noise gating should not suppress necessary adaptation under real distribution shift—monitor downstream metrics.
- Compute/energy savings through fewer training steps and fewer hyperparameter sweeps
- Sector: Energy/operations; environmental sustainability.
- What: Faster convergence and reduced grokking delay translate into fewer steps and fewer experimental runs; lower carbon/compute cost.
- Tools/workflows: Integrate with experiment tracking (W&B/MLflow) to quantify step reductions and energy savings.
- Assumptions/dependencies:
- Gains depend on noise prevalence and model regime (feature-learning preferred).
- Correctly tuned base optimizer (LR, weight decay) remains important.
- Robust supervised learning under label noise
- Sector: Healthcare/biomed (noisy labels), finance (noisy targets), education (crowd-sourced labels).
- What: SNR gating reduces propagation of mislabeled examples into parameters; mitigates overfitting to noise without specialized robust-loss design.
- Tools/workflows: Standard supervised pipelines with optimizer plugin.
- Assumptions/dependencies:
- Noise is not adversarially targeted to produce high-SNR gradients.
- Some minimum batch size to estimate variance reliably (or good streaming EMA).
Long-Term Applications
The items below are feasible directions that require further research, scaling, or engineering to realize broadly.
- Block-structured or layerwise population-risk preconditioners
- Sector: Software/ML infrastructure.
- What: Move beyond diagonal gating to optimize tr(M A_B) with block-diagonal or low-rank M, capturing cross-parameter covariance for larger gains.
- Dependencies:
- Efficient estimation of cross-parameter gradient covariance at scale.
- Memory/computation overhead control (e.g., sketching, low-rank factorizations).
- Closed-loop “generalization control” in production training
- Sector: MLOps; online/continual learning.
- What: Use online Ω_B, train–test coupling proxies, and SNR gates in a feedback controller that adapts LR, batch size, and layer freezing to maintain a target generalization rate.
- Dependencies:
- Robustness under non-stationarity and mild covariate shift.
- Safety constraints for live systems; interpretable thresholds.
- Reservoir-aware architecture and optimizer design
- Sector: AutoML/architecture search.
- What: Explicitly shape the empirical NTK’s evolution K(t) to trap noise in the reservoir and prioritize signal-channel dissipation, via architectures (e.g., gating layers) or optimizer schedules.
- Dependencies:
- Practical estimators of cumulative dissipation (𝓦_S) or surrogates per layer.
- Understanding architecture–kernel dynamics beyond the lazy regime.
- Automated early stopping and hyperparameter selection without large validation sets
- Sector: MLOps; privacy-sensitive domains.
- What: Replace (or sharply reduce) held-out validation with online population-risk metrics and train–test coupling estimates for early stopping and HP search.
- Dependencies:
- Strong validation in non-i.i.d. or shifted settings.
- Governance acceptance and standardization; empirical benchmarks across domains.
- Interpretability via signal-channel/reservoir decomposition
- Sector: Responsible AI.
- What: Attribute training loss reductions to “signal vs. reservoir,” identifying memorized artifacts versus generalizable features, supporting dataset diagnostics and de-biasing.
- Dependencies:
- Practical approximations to project training signals onto estimated signal/reservoir subspaces at scale.
- Tooling for per-example/per-feature attributions along the realized trajectory.
- Data-centric active learning with gradient-agreement scoring
- Sector: Data engineering; cost-efficient labeling.
- What: Select new labels to acquire by maximizing expected Ω_B gains (coherent gradient directions), improving label efficiency.
- Dependencies:
- Fast estimation of agreement for candidate unlabeled points (surrogate models or subset scanning).
- Theoretical extensions to active-learning settings.
- Robust online learning under distribution shift
- Sector: Finance; robotics; edge devices.
- What: Extend the exchangeability-based population-risk rate to covariate shift or non-stationary environments, maintaining safe updates when distributions drift.
- Dependencies:
- Shift-aware corrections (importance weighting, domain adaptation).
- Drift detection coupled with SNR gating.
- Reservoir-based memorization detection for content safety
- Sector: AI safety; compliance.
- What: Use reservoir-invisibility concepts to flag training outputs likely not to transfer to test (potential memorization), and to detect leakage of unique training data.
- Dependencies:
- Robust proxies for “reservoir” components without full operator computation.
- Evaluation on privacy and copyright-sensitive datasets.
- Curriculum and augmentation policies that steer kernel evolution (grokking-aware training)
- Sector: Education; algorithmic reasoning tasks; program synthesis.
- What: Design curricula and augmentations that accelerate signal migration from reservoir to signal channel, reducing grokking delays.
- Dependencies:
- Empirical characterization of how data order and augmentations affect K(t) and 𝓦_S.
- Task-specific heuristics validated at scale.
- Sector-specific standardized metrics and benchmarks
- Sector: Policy/governance; sustainability.
- What: Establish Ω_B-based “generalization efficiency” and “drift from reference” metrics as standard reports for procurement and regulatory submissions; track carbon savings from faster convergence.
- Dependencies:
- Community adoption and reproducible benchmarks.
- Clear guidance on interpreting metrics under dataset shift and noise.
- Edge/on-device learning with small batches
- Sector: Mobile/IoT/robotics.
- What: Use SNR gating to avoid overfitting in tiny-batch or streaming settings, enabling safer on-device personalization.
- Dependencies:
- Reliable variance estimation at very small batch sizes (enhanced EMAs, microbatching).
- Compute/memory constraints on-device.
Notes on Key Assumptions and Dependencies
- Exchangeability/i.i.d.: Many guarantees rely on exchangeable sampling; strong dataset shift reduces theoretical guarantees but often remains practically useful with caution.
- Loss functions: Exact train–test coupling is proven under squared loss; the population-risk rate and SNR gate are derived via first-order expansions and are empirically effective under common losses (e.g., cross-entropy), but the strongest theory is for squared loss.
- Step-size and smoothness: First-order approximations assume sufficiently small steps and Lipschitz Jacobians along the trajectory.
- Variance estimation: Reliable σ_k² estimation may require per-example gradients or microbatching; practical implementations use streaming EMAs of deviations (one extra state vector).
- Computation/scale: Diagonal gating is low overhead; block-structured extensions will need careful engineering (sketching/low-rank methods).
- Domain shift: Reservoir invisibility and population-risk estimates hold most cleanly in-distribution. For OOD settings, combine with drift detection and importance weighting.
- Governance: While the approach reduces dependence on large validation sets, external evaluation remains necessary for regulatory compliance and deployment safety.
Glossary
- Algorithmic stability: A framework bounding how much a learning algorithm’s output changes when a single training point is modified; used to reason about generalization. "Algorithmic stability \citep{bousquet2002stability,hardt2016train} bounds the sensitivity to single-point perturbations"
- Benign overfitting: A regime where models interpolate the training data (even with noise) yet still generalize well due to spectral properties. "such as benign overfitting, double descent, implicit bias, and grokking."
- Cumulative dissipation Gramian: An integral operator capturing the total loss dissipation along the training trajectory; its range defines the signal channel and its kernel the reservoir. "The cumulative dissipation Gramian and its spectral projectors (derivation from output dynamics in \autoref{sec:operator_derivation}) are"
- DPO (Direct Preference Optimization): A preference-learning fine-tuning method optimizing policies against pairwise preferences without a reward model. "improves DPO fine-tuning under noisy preferences while staying closer to the reference policy."
- Double descent: A phenomenon where test error decreases, peaks around interpolation, then decreases again as model capacity grows. "Double Descent (, bottom) is noise moving between channels as model capacity sweeps across interpolation"
- Drift–diffusion separation: The separation of coherent gradient drift (signal) accumulating linearly from stochastic minibatch fluctuations (diffusion) that grow only as a square root. "\begin{theorem}[Drift--diffusion separation]\label{thm:minibatch_coherence}"
- Empirical risk: The average loss on the training set that standard ERM minimizes. "Population-risk training is therefore the test-side analogue of empirical-risk descent."
- Exchangeability: The property that the joint distribution is invariant to permutation, enabling leave-one-out population risk estimates. "Exchangeability turns the same operators into population risk."
- Feature-learning regime: The regime where the network’s features (and NTK) evolve significantly during training, beyond the lazy (frozen-kernel) approximation. "the full feature-learning regime."
- Frozen-kernel limit: The approximation where the NTK is treated as constant during training, reducing dynamics to kernel regression. "The frozen-kernel limit reduces to a closed-form spectral filter"
- Generalization gap: The difference between population risk and empirical risk; here tied to self-influence via the test transfer. "recovers the expected generalization gap as an average of self-influences"
- Gradient flow: Continuous-time limit of gradient descent dynamics used to analyze training trajectories. "Under gradient flow "
- Grokking: Delayed generalization after prolonged memorization, associated with signal moving from reservoir to signal channel. "grokking."
- Implicit bias: The tendency of optimization (like gradient descent) to prefer certain solutions among many interpolants. "Implicit Bias (, top-left) is the spectral schedule of "
- Implicit neural representations: Networks that represent signals (like images or fields) via continuous functions rather than discrete grids. "suppresses memorization in PINNs and implicit neural representations"
- Influence functions: Classical tools estimating the impact of upweighting or removing a data point on the learned model. "Classical influence functions \citep{cook1982residuals} approximate the effect of removing a training point"
- Jacobian (parameter Jacobian): The matrix of partial derivatives of outputs with respect to parameters, central to defining the tangent kernel. "assemble their parameter Jacobian"
- Lazy regime: Training regime where parameters move little so the NTK remains effectively constant. "frozen-kernel theory \citep{jacot2018neural} describes the lazy regime"
- Leave-one-out: Estimating generalization by training without one sample and evaluating on it; used here via exchangeability. "The population-risk objective of \autoref{sec:population_risk} reads the leave-one-out displacement"
- Lipschitz constant: A global bound on how fast a function can change; used in stability and Taylor remainder bounds. "requires global Lipschitz constants"
- Martingale differences: Zero-mean conditional increments used to model minibatch gradient noise. "The fluctuations are martingale differences with respect to ."
- Minibatch SGD: Stochastic gradient descent using small random subsets of data each step; here analyzed via drift vs diffusion. "minibatch SGD ensures that coherent population signal accumulates via fast linear drift"
- Neural tangent kernel (NTK): The kernel K=J JT induced by the network’s Jacobian; governs linearized training dynamics. "The neural tangent kernel (NTK) \citep{jacot2018neural,du2019gradient} shows that sufficiently wide networks evolve as kernel methods"
- Operator norm: The largest singular value of a linear operator; used to quantify kernel drift magnitude. "even when the kernel evolves in operator norm"
- PAC-Bayes bounds: Generalization bounds using a prior and posterior over hypotheses and a KL penalty. "PAC-Bayes bounds \citep{mcallester1999pac,dziugaite2017computing} incorporate a data-dependent posterior"
- PINN (Physics-Informed Neural Network): A neural network trained to satisfy physical PDEs via a loss that enforces governing equations and data. "Population-risk training on a noisy-IC PINN."
- Population risk: Expected loss over the data distribution, as opposed to empirical (training) loss. "we derive a practical method that trains directly on population risk."
- Population-risk descent: An optimization rule aiming to decrease population risk directly at each step. "\begin{corollary}[Population-Risk Descent]\label{cor:pop_risk_descent}"
- Population-risk gate: A per-parameter criterion that allows updates only when estimated signal-to-noise favors generalization. "and the population-risk gate of \autoref{sec:pop_risk_training}."
- Positive semidefinite (PSD): A matrix/operator with nonnegative quadratic forms; crucial for kernels and preconditioners. "any positive-semidefinite preconditioning of the parameter updates"
- Preconditioner: A transformation applied to gradients to change the metric of updates, improving optimization or generalization. "This objective reduces in practice to an SNR preconditioner on top of Adam"
- Propagator: The linear operator evolving the output gradient over time along the trajectory. "the propagator solves the linear ODE"
- Rademacher complexity: A data-dependent capacity measure bounding generalization via random sign averages. "Rademacher complexity \citep{bartlett2002rademacher}"
- Reservoir (in this paper): The kernel (null space) of the cumulative dissipation; directions where training dissipated no loss and which are invisible at test. "the reservoir is , the directions where training dissipated none."
- Replace-two stability: A stability notion assessing sensitivity to replacing two samples, used here to bound noise drift. "under a replace-two stability hypothesis on the projected gradient"
- Ridge regression: L2-regularized least squares; recovered in the frozen-kernel limit as a special spectral filter. "ridge regression as different choices of one preconditioner"
- Self-influence: The contribution of a training point to the generalization gap via its leave-one-out effect. "recovers the expected generalization gap as an average of self-influences"
- Self-influence metric: An output-space metric prioritizing directions that reduce generalization error, derived from self-influence. "the natural is the self-influence metric"
- Signal channel: The range of the cumulative dissipation; directions along which training dissipated loss and that affect test predictions. "The signal channel is "
- SNR preconditioner: A signal-to-noise-ratio-based per-parameter scaling that prefers coherent signal over noisy directions. "This objective reduces in practice to an SNR preconditioner on top of Adam"
- Sobolev refinement: A smoothness-based bound tightening the bias term via Sobolev-space regularity. "The Sobolev refinement "
- Spectral decay: The rate at which kernel or covariance eigenvalues decrease, affecting interpolation generalization. "interpolation can be statistically harmless under appropriate spectral decay"
- Spectral projector: An operator projecting onto eigen-subspaces corresponding to specified eigenvalue ranges. "its spectral projectors (derivation from output dynamics in \autoref{sec:operator_derivation}) are"
- Test-invisible reservoir: Reservoir directions that cannot affect any test prediction because the test transfer annihilates them. "near-zero eigenvalues trap residual error in a test-invisible reservoir."
- Test transfer operator: The operator mapping training output gradients to test-set output displacements over a window. "the test transfer operator is"
- Test visibility spectrum: The spectrum of the test-side operator governing which directions are visible at test relative to dissipation. "The test visibility spectrum is strictly bounded by cumulative dissipation "
- Train–test coupling: The exact linear relation showing test displacement is determined by training displacement (under squared loss) along the realized path. "\begin{theorem}[Train-test coupling]\label{thm:train_test_coupling}"
- Uniform convergence: A worst-case generalization framework bounding deviation between empirical and population risks uniformly over hypotheses. "Uniform-convergence bounds, whether expressed in terms of VC dimension ... are vacuous at practical scale"
- VC dimension: A combinatorial capacity measure indicating the largest set a hypothesis class can shatter. "VC dimension \citep{vapnik1971uniform}"
Collections
Sign up for free to add this paper to one or more collections.