Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

Published 8 May 2026 in cond-mat.dis-nn, cs.AI, and stat.ML | (2605.07870v1)

Abstract: We study the evolution of hidden-weight spectra in wide neural networks trained by (stochastic) gradient descent. We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk. We apply this framework to two settings: (1) infinite-width nonlinear networks in mean-field/$μ$P scaling and (2) deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios. Our theory predicts how outliers evolve with training time, width, output scale, and initialization variance. In deep linear networks, $μ$P yields width-consistent outlier dynamics and hyperparameter transfer, including width-stable growth of the leading NTK mode toward the edge of stability (EoS). In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit. We show that this bulk+outlier picture is descriptive of simple tasks with small output channels, but that tasks involving large numbers of outputs (ImageNet classification or GPT language modeling) are better described by a restructuring of the spectral bulk. We develop a toy model with extensive output channels that recapitulates this phenomenon and show that edge of the spectrum still converges for sufficiently wide networks.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a two-level DMFT framework that tracks the emergence and dynamics of spectral outliers in deep networks.
Empirical analyses in both infinite-width nonlinear and proportional high-dimensional linear regimes confirm the framework’s precise predictions.
Insights into hyperparameter transfer and the breakdown of the bulk-plus-outlier paradigm suggest new directions for optimization in extensive-output settings.

Spectral Dynamics in Deep Networks: Theoretical Frameworks and Empirical Regimes

Introduction

This work presents a unified theoretical and empirical investigation into the spectral dynamics of hidden weight matrices in deep neural networks during training. The authors systematically develop a two-level dynamical mean-field theory (DMFT) specialized for wide neural architectures, quantifying the joint evolution of random-matrix “bulk” and isolated “outlier” singular values. Applications are considered both in infinite-width nonlinear networks (mean-field/ $\mu$ P scaling) and proportional high-dimensional deep linear networks. The role and limitations of the bulk-plus-outlier paradigm are critically examined, particularly in the context of hyperparameter transfer and regimes with extensive output dimension.

Two-Level DMFT for Spectral Dynamics

The central methodological contribution is a two-level DMFT formalism, capable of tracking spectral outliers emerging from statistically dependent, training-generated finite-rank perturbations, rather than the traditional case of planted, initialization-independent spikes as in classical BBP theory. Specifically, this framework derives a closure for the singular value distribution of random matrices with low-rank, causally dependent “spike” structure. The formalism is presented in sufficient generality to recover the deterministic macroscopic limits for both deep linear and nonlinear architectures trained under various scaling regimes.

Notably, the theory yields a finite-dimensional determinant criterion for outlier emergence: isolated singular values correspond to the zero locus of a matrix-valued function whose coefficients are the response and correlation functions of the associated DMFT. The model predicts the timing, scaling, and robustness of BBP-like transitions, and the dynamics of corresponding outlier modes throughout training.

Infinite-Width Feature-Learning Regime

Under mean-field/ $\mu$ P scaling, infinite-width neural networks continue to exhibit nontrivial feature learning, deviating sharply from the lazy/NTK regime where hidden features remain essentially frozen. Within this super-wide limit (width $N \to \infty$ at fixed input, steps, and batch size), the derivation fully characterizes the evolution of hidden weight spectra, showing that training induces low-dimensional outlier singular values, which detach from the random bulk at BBP-like thresholds governed by the network’s output scaling parameter $\gamma$ —the relevant measure of “richness”.

Empirically, the finite-outlier DMFT accurately predicts both the time course and width stability of outlier dynamics for wide convolutional ResNets trained on small-channel tasks such as CIFAR-10, quantitatively matching the observed spectra. This confirms that the bulk-plus-outlier structure accurately describes these regimes, supporting theoretical analysis of stability, mode sharpening, and optimal scaling of hyperparameters.

Proportional High-Dimensional Linear Networks

Transitioning to deep linear architectures, the proportional limit (width, input, and number of samples scale with fixed ratios) allows nontrivial width-dependence through the high-dimensional scaling laws. The authors apply the DMFT-derived theory to provide exact, width-resolved predictions for spectral outlier evolution and BBP transition times. The highly technical results support the following major findings:

In $\mu$ P parameterization, both the BBP threshold and post-transition outlier dynamics are nearly width-invariant, enabling successful and robust hyperparameter transfer.
In contrast, standard NTK parameterization exhibits pronounced width dependence: outlier detachment and leading kernel eigenvalue trajectories fail to align across changing network sizes, undermining transferability.
The theory captures how, after BBP, the leading NTK kernel mode sharpens towards the empirical edge-of-stability (EoS), dynamically controlling the maximal stable learning rate.
These predictions are backed by strong quantitative agreement with empirical training curves and spectral measurements.

Extensive Output Regime and Bulk Restructuring

The authors address regimes where the number of output channels (classes, vocabulary size, etc.) scales proportional to width/input dimension, as in large-scale language modeling or ImageNet classification. Empirically, such settings show dramatic restructuring of the entire spectral bulk during training—isolated outliers are insufficient to describe the phenomenology.

A corresponding extension of the DMFT formalism is derived for linear networks with extensive output dimension. Here, the finite-rank spike picture collapses; the update term contributes O(1) modifications to the spectral density, producing non-Marchenko–Pastur bulks. In this regime, the convergence of top spectral edges (eigenvalues) to width-stable plateaus is still possible, but the critical determinant is the relative scaling of width to output and learnable signal strength.

Empirical analysis of ImageNet-trained ResNet18 weights and next-word prediction transformers demonstrates that the theoretical spectral restructuring is realized in practical settings, with extensive redistribution of singular values away from the Marchenko–Pastur prediction.

Implications for Hyperparameter Transfer and Learning Theory

The theoretical framework formulated in this work has significant implications both for practical optimization of modern deep networks and for core questions in statistical learning theory:

In width-stable regimes (mean-field/ $\mu$ P nonlinear and proportional-width linear networks), the bulk-plus-outlier picture rigorously predicts success/failure modes of hyperparameter transfer, especially learning rate scaling. The maximal stable learning rate aligns with the top outlier mode, achieving near universality in proper scaling.
In extensive-output settings, uniform hyperparameter transfer requires not only outlier alignment but also consistent spectral bulk restructuring—a fundamentally more challenging requirement, motivating new directions in optimization theory.
The connection to the EoS literature is formalized: spectral sharpness and escaping modes are realized as outlier dynamics within a unified dynamical framework.

Strong Claims and Contradictory Phenomena

The paper demonstrates analytically and empirically that NTK parameterization fails to produce width-consistent outlier dynamics even when the large-width limit exists for loss and generalization, providing a contradictory behavior to traditional kernel theory expectations.
It is shown that for tasks with extensive output dimension, the finite-outlier paradigm is insufficient: the empirical spectral density cannot be captured by a bulk-plus-outlier ansatz, but rather requires a restructuring of the spectral bulk itself.

Limitations and Future Directions

The current DMFT approach is limited to super-wide nonlinear and proportionally wide linear networks with either finite- or single-step extensive class regimes. Extending the analysis to multi-step dynamics in the extensive-output regime or to nonlinear models with evolving high-rank structure remains an open challenge, as does direct application to very deep or recurrent architectures.

Future directions include extension to settings with correlated or structured data, deeper architectural features (e.g., transformers, normalization), and non-Gaussian or heavy-tailed initialization statistics. The two-level DMFT approach is broadly extensible to eigenvalue spectra and non-symmetric matrices, with implications for both feedforward and recurrent architectures.

Conclusion

This paper constructs a unifying, technically rigorous framework for analyzing the spectral evolution of large and deep neural networks under different width and output scaling limits. It provides exact, self-consistent predictions for the dynamics of both bulk and outlier modes, connecting theoretical matrix models to empirical phenomena in both narrow- and large-output regimes. The results supply precise mechanistic underpinnings for hyperparameter transfer and learning rate scaling, while identifying the breakdown points of the bulk-plus-outlier model. The methodologies and findings frame open questions on the nature of learning in large, feature-learning models, particularly as the community advances toward even larger architectures with highly extensive output spaces.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Spectral Dynamics in Deep Networks: A Simple Explanation

1) What is this paper about?

This paper studies how the “shape” of a neural network’s weights changes during training, especially in very wide (large) networks. The authors look at the spectrum of a weight matrix—the list of its singular values. Most of these values form a big “cloud” (called the bulk), but a few “stand-out” values (called outliers) can pop out as the network learns. The paper builds a new theory to predict when and how those outliers appear and grow, and it uses this to explain when a single learning rate can work well across different network sizes.

Think of it like listening to an orchestra:

The bulk is the background sound from many instruments playing together.
The outliers are the soloists who stand out above the rest. This paper explains when the soloists appear, how loud they get, and what that means for training.

2) What questions does the paper try to answer?

In simple terms, the paper asks:

As we train a deep network, when do special patterns (outliers) in the weights appear, and how do they grow?
How does this depend on the network’s width (number of neurons), the learning rate, how we scale the network, and the output size?
Why do some ways of setting up a network (like µP, pronounced “mu-P,” a specific way to scale parameters) let you use almost the same learning rate across different widths, while other setups (like NTK parameterization) do not?
For tasks with many outputs (like ImageNet with 1000 classes or LLMs with huge vocabularies), does learning create only a few “soloists,” or does it reshape the whole orchestra?

3) How did they study it? (Methods explained with analogies)

The authors build a two-level mathematical framework:

Level 1: Training dynamics (how learning creates patterns)
- As the network trains, weight updates create “spikes” (special directions) in the weight matrix. But these spikes are not independent from the original random weights—they’re shaped by them. This makes standard math tools hard to use.
Level 2: Spectral probe (how to detect soloists)
- They design a test that scans the spectrum and tells you whether a “soloist” (outlier) has emerged from the “orchestra” (bulk), similar to testing if a note is loud enough to be heard above the noise.

Together, this two-level approach is called dynamical mean-field theory (DMFT). You can think of DMFT like a weather forecast for a huge crowd: instead of tracking each person, you track averages and how they respond. That lets you predict when “special patterns” (like a wave in a stadium) form and stand out.

They apply this to two cases:

Infinite-width nonlinear networks under µP scaling (a setup that keeps feature learning alive even for huge widths).
Deep linear networks where width, input size, and dataset size all grow together (a realistic setting to study what changes with width).

They also compare two parameterizations:

µP (mean-field scaling): designed so training behavior stays similar as width changes.
NTK parameterization: often leads to more “lazy” training (less feature learning) as width grows.

4) What did they find, and why does it matter?

Here are the main takeaways, stated simply:

Bulk + Outliers picture (few soloists)
- For many “small-output” problems (like CIFAR-10 image classification), training creates a few strong outliers while the bulk stays mostly the same. This matches the BBP transition from random matrix theory: when signal gets strong enough, a few values pop out from the bulk.
- Their theory predicts when these outliers escape the bulk and how they grow over time, with learning rate, with width, and with initialization scale.
µP makes learning-rate transfer work across widths
- In deep linear networks, µP scaling makes the top “sharpness” (related to the largest eigenvalue of the NTK, a matrix that controls how quickly loss can change) grow in a way that is almost the same across different widths. This helps explain why the same learning rate works well when you increase the width.
- In contrast, NTK parameterization shows strong width dependence: the top mode evolves differently for different widths, so you can’t easily reuse the same learning rate.
When outputs are “huge,” the whole orchestra changes
- For tasks with many outputs (like ImageNet or GPT-style LLMs), training doesn’t just create a few soloists. It reshapes the whole spectrum—the bulk itself moves and changes shape.
- They build a simple linear “toy model” with many output channels to show how and why this happens. Even then, for wide enough networks, the very edge of the spectrum (the largest values) still settles to a predictable limit.
Practical evidence
- In µP CNNs on CIFAR-10: the bulk stays stable; a few outliers detach, just as the theory predicts.
- In ImageNet and LLMs: the bulk deforms and develops a tail—showing we’ve moved beyond the “few soloists” story.

Why this matters:

It connects a visible, measurable signal (the top eigenvalue or outlier) to training stability and learning-rate choice.
It explains when hyperparameters (like learning rate) can be reliably transferred across model sizes—one reason µP is useful in practice.
It offers a way to predict training behavior without having to run huge experiments.

5) What are the implications?

Better, more predictable scaling: The results help explain why µP often lets you use almost the same learning rate as you scale your model up—saving time and compute.
New tools for diagnostics: Watching the weight spectrum (bulk and outliers) can tell you if your model is learning features or just behaving lazily.
Guidance for big-output tasks: If your task has many outputs (like language modeling), expect the whole spectrum to change, not just a few outliers. The toy model provides a starting point for predicting this.
Theory that matches practice: The two-level DMFT framework handles realistic dependence between learned structure and random initialization—closing a gap in previous theory and helping us understand how deep networks truly form representations.

In short: the paper gives a clear, physics-inspired way to “listen” to a network as it learns, identify when and how the “soloists” appear, and use that to choose hyperparameters that work across sizes—while also explaining when the whole orchestra changes its sound.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list summarizes what remains missing, uncertain, or unexplored, framed as concrete directions future researchers could pursue:

Formalize and rigorously prove Result 1 (bulk+outlier characterization with statistically coupled spikes), including precise conditions under which the dependence of spikes on the random initialization still yields the stated outlier criterion and bounds on finite-width errors.
Develop computationally efficient algorithms to find the outlier set O_S by solving det A(z)=0 at scale, reducing the current O(S³⁾ cost and enabling online BBP-time detection during training.
Extend the two-level DMFT from super-wide nonlinear networks to proportionally wide nonlinear networks (finite D/N and B/N), capturing realistic multi-step SGD dynamics and dataset-sized regimes.
Incorporate common training components—momentum, Adam, weight decay, normalization layers (BatchNorm/LayerNorm), dropout, data augmentation—into the DMFT and characterize their impact on bulk/outlier evolution and learning-rate transfer.
Move beyond random or single-index targets to structured, non-Gaussian data and realistic label noise; quantify how data correlations and distributional shifts modulate spectral dynamics.
Develop a full multi-step, multi-layer DMFT in the extensive-output regime (C ~ N) that tracks correlations built across layers and time, provides conditions for bulk reshaping vs finite-rank spikes, and yields predictive spectral dynamics throughout training rather than at one GD step.
Predict and validate the full spectral density in extensive-output settings beyond the first step, with end-to-end evaluations on ImageNet-scale CNNs and transformer LLMs across layers and training stages.
Explain heavy-tailed spectra observed in trained networks within this framework; derive when and why heavy tails emerge, how they interact with outliers, and how parameters/optimizers influence tail formation.
Derive analytic expressions for BBP times in nonlinear networks as functions of richness γ, initialization variance σ^2, depth L, batch size B, and data statistics; compare with empirical BBP-time measurements.
Provide closed-form or tractable approximations for top outlier locations and NTK eigenvalue trajectories in deep linear and nonlinear networks to enable fast learning-rate selection and EoS diagnostics.
Establish a rigorous link between top NTK eigenvalue dynamics and the maximal stable learning rate under μP across architectures, and quantify deviations under NTK-P and other parameterizations.
Identify sufficient conditions for width-stable learning-rate transfer in μP when losses vary across widths; characterize spike dynamics and EoS proximity needed to ensure transferability.
Quantify finite-width corrections: bound discrepancies between predicted and empirical spectra at widths N≈10^2–10³ and determine when finite-N effects invalidate the bulk+outlier description.
Generalize the method from singular values to eigenvalue spectra of non-symmetric matrices and recurrent networks; analyze dynamical stability, chaos, and EoS via two-level DMFT.
Integrate explicit regularization (weight decay, dropout), norm constraints, and sparsity-inducing mechanisms; determine their effects on outlier emergence, bulk reshaping, and learning-rate transfer.
Create practical procedures to estimate DMFT correlation and response functions (C and R) directly from training traces or small probes in real networks, bypassing full theoretical solves.
Systematically test the framework across architectures (CNNs, transformers), layers, tasks (vision, language), and output sizes; map when spectra remain near MP vs deform, and quantify outlier counts and trajectories.
Compare initialization schemes (non-Gaussian, orthogonal, scaled variants) and parameterizations (μP, NTK-P, CompleteP, others) to chart how they alter spectral dynamics, BBP thresholds, and optimizer stability.
Derive a “phase diagram” in (C/N, γ, σ^2, depth, batch size) that predicts whether training yields finite outliers or bulk restructuring, including explicit transition conditions and scaling laws.
Design spike-aware training strategies (e.g., regularizing or amplifying dominant modes) guided by DMFT predictions; evaluate whether controlling spike dynamics improves optimization speed or generalization.
Analyze stochasticity: quantify how mini-batch noise and data shuffling impact response functions, outlier dynamics, and variance around predicted trajectories; characterize noise-induced shifts in BBP time.
Provide convergence guarantees and robust numerical recipes for solving A(z)=0 with multiple close outliers and near-edge cases; assess sensitivity to numerical tolerances and spectral estimation procedures.
Extend the theory to convolutional weights with sharing and to attention matrices with softmax constraints; adapt DMFT assumptions to parameter tying and nonlinearity-specific couplings.
Theoretically explain the observed non-trivial loss scaling exponent vs number of heads in LLMs; connect scaling behavior to spectral restructuring and extensive-output effects.
Validate “edge-of-spectrum convergence for sufficiently wide networks” in extensive-output regimes: derive quantitative width requirements as functions of χ=C/D and γ_0, and provide convergence rates.
Generalize the extensive-output toy model to multi-step SGD with momentum and to deep nonlinear networks; demonstrate predictive accuracy for bulk shifts, tails, and top-mode sharpening over training.
Assess how normalization layers (BatchNorm/LayerNorm) modify the statistical coupling of spikes to initialization (χ, ξ dependence) and whether the Feature Structural Condition remains valid.
Formalize the causal dependence of spikes on {χ, ξ} and identify scenarios where causality may be violated (e.g., temporal weight sharing in RNNs); determine implications for outlier prediction.
Determine practical criteria for selecting the number of spike directions S needed to capture observed outliers; develop truncation/error bounds to keep A(z) computations tractable without losing accuracy.
Link spectral outliers to learned task-aligned features: devise methods to extract, interpret, and validate spike directions against target structure, and quantify their contribution to generalization.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following opportunities can be deployed now with reasonable engineering effort, drawing directly from the paper’s findings on bulk+outlier spectral dynamics, BBP transitions, and μP-driven width-consistent behavior.

Hyperparameter transfer across widths using μP
- Sectors: software, AI/ML industry, education, robotics
- What: Adopt μP/CompleteP parameterizations to obtain width-stable learning-rate (LR) and training dynamics; reduce retuning when scaling model width.
- Tools/workflows: Parameterization libraries (μP/CompleteP), width-scaling playbooks, automated scripts to copy hyperparameters across model sizes.
- Assumptions/dependencies: Requires correct μP implementation and consistent initialization; stability claims rely on infinite-width tendencies but perform well in practice for wide models.
Learning-rate selection via edge-of-stability (EoS) using top NTK mode
- Sectors: software, AI/ML industry, robotics, energy
- What: Estimate the top NTK eigenvalue during training and choose an LR that keeps dynamics near—but below—the EoS; accelerates convergence with fewer instabilities.
- Tools/workflows: Online power iteration on empirical NTK/Gauss–Newton/gradient-covariance approximations; training dashboards to visualize λ_max trends; LR controllers with safety margins.
- Assumptions/dependencies: Approximating the NTK is computationally costly; quality of estimates depends on batch noise and layer selection; EoS as a practical heuristic, not a guarantee of best generalization.
Training stability monitoring via bulk+outlier spectra
- Sectors: software, AI/ML industry, finance (model risk), safety
- What: Track when outliers detach from the Marchenko–Pastur (MP) bulk (BBP time) to detect onset of feature learning, instability, or regime shifts.
- Tools/workflows: Periodic randomized SVD/power iteration on layer weight covariances; TensorBoard-like spectral panels; alerts when bulk shifts or outliers surge.
- Assumptions/dependencies: Feasible for selected layers; full-spectrum computation is heavy for convs/transformers; interpretability best for tasks with few outputs.
Parameterization choice and initialization tuning
- Sectors: software, AI/ML industry, education
- What: Prefer μP/CompleteP and adjust output multipliers and initialization variance to encourage rich feature learning while preserving width consistency.
- Tools/workflows: Layerwise scaling templates, initializer presets, μP configuration checkers.
- Assumptions/dependencies: Benefits strongest in wide regimes; mis-specified scaling can negate transfer gains.
Low-rank adaptation and compression guided by spectral outliers
- Sectors: software, mobile/edge, robotics
- What: Use the number and strength of outliers to set LoRA ranks and identify compressible subspaces; prune or factorize along spike directions.
- Tools/workflows: Outlier-count heuristics for rank selection; integration with LoRA/adapter libraries.
- Assumptions/dependencies: Finite-outlier picture fits tasks with few outputs (e.g., small-class vision, regression); extensive-output tasks may require different heuristics.
Task regime diagnostics: finite-outlier vs bulk restructuring
- Sectors: software, AI/ML industry (vision, language)
- What: For small-output tasks (e.g., CIFAR-10 CNNs), expect finite spikes with stable MP bulk; for large-output tasks (ImageNet, LMs), expect bulk reshaping—adjust monitoring and interpretation accordingly.
- Tools/workflows: Task-specific spectral dashboards; switch between “outlier monitor” and “bulk-shape monitor.”
- Assumptions/dependencies: Based on empirical observation and toy-model analysis; full theory for extensive outputs beyond one step is pending.
Reproducibility and reporting standards that include spectral diagnostics
- Sectors: academia, policy (best practices), AI/ML industry
- What: Report parameterization (μP/NTK-P), output multipliers, and spectral diagnostics; standardize how scaling experiments document stability and transfer.
- Tools/workflows: Experiment templates; CI checks for scaling and spectral logs.
- Assumptions/dependencies: Community adoption; adds overhead to training pipelines.
Compute and energy savings through reduced retuning
- Sectors: energy, AI/ML industry, sustainability
- What: Fewer grid searches due to width-stable LR and μP scaling; shorter path to stable training configurations.
- Tools/workflows: μP-based scaling recipes; LR setters using λ_max estimates.
- Assumptions/dependencies: Savings realized when teams adopt μP consistently; NTK-based estimation adds some compute but typically less than exhaustive tuning.

Long-Term Applications

These opportunities require further research or engineering, particularly to scale to large-output regimes, extend theory from one step to full training, or integrate into robust products.

Automated LR controllers that “ride the edge”
- Sectors: software, AI/ML industry, robotics
- What: Optimizers that continuously estimate top NTK mode and adjust LR to remain near EoS across training phases and widths.
- Tools/products: Spectral-aware SGD/Adam variants; control-theory inspired LR schedulers.
- Assumptions/dependencies: Reliable online NTK estimation; robustness to noise and latency; safeguards for non-stationary data.
Spectral-aware optimizers and preconditioners
- Sectors: software, AI/ML industry
- What: Optimizers that shape the spectrum—encourage “useful” outliers or constrain bulk movement—to balance speed and stability.
- Tools/products: Preconditioners using low-rank NTK approximations; layerwise spectral regularizers.
- Assumptions/dependencies: Need validated links between spectral shaping and generalization/robustness; overhead must be controlled.
Design guidance for extensive-output tasks (LMs, large-class vision)
- Sectors: software, education, NLP, vision
- What: Use large-output theory to pick width–vocabulary/class ratios that keep edge eigenvalues stable; plan head counts and layer widths.
- Tools/workflows: Sizing calculators driven by extended DMFT/RMT; design rules-of-thumb for χ=C/D and ν=N/D.
- Assumptions/dependencies: Current theoretical results address one-step effects; full-training extensions needed for strong guarantees.
Scalable spectral toolkits for production deep nets
- Sectors: software, AI platforms
- What: Fast, distributed estimators of top eigenvalues and bulk edges for conv/attention layers; incremental randomized SVD integrated into training loops.
- Tools/products: Libraries in PyTorch/JAX; plugins for training dashboards and profilers.
- Assumptions/dependencies: Engineering for memory/computation efficiency; numerical stability on large models.
Continual/fine-tuning workflows that track and reuse spike subspaces
- Sectors: software, enterprise ML, healthcare, robotics
- What: Persist and update identified outlier subspaces across tasks; guide low-rank adapters and reduce catastrophic forgetting.
- Tools/products: Spectral “subspace checkpoints”; adapter-rank schedulers based on outlier evolution.
- Assumptions/dependencies: Stability of subspaces across domains; privacy constraints for storing subspace data.
Robustness and security monitoring via spectral fingerprints
- Sectors: safety, finance, policy, cybersecurity
- What: Use unusual bulk shifts or abnormal outlier growth as indicators of distribution shift, poisoning, or backdoor behavior.
- Tools/workflows: Spectral anomaly detectors; alert thresholds for λ_max and bulk-edge deviations.
- Assumptions/dependencies: Requires labeled baselines of healthy spectra per model/task; low false-positive rates needed.
Policy and best-practice guidelines for compute-efficient scaling
- Sectors: policy, AI governance, industry
- What: Standardize μP-like parameterizations and spectral diagnostics in scaling reports; promote reproducible, compute-efficient training.
- Tools/workflows: Checklists, documentation standards, audit templates including spectral metrics.
- Assumptions/dependencies: Community and regulator buy-in; clarity on privacy/safety of sharing spectral data.
Domain-specific scaling from small to large systems
- Sectors: healthcare, robotics, embedded/edge
- What: Transfer LRs and schedules when porting models across device classes or hospitals; assurances of stability when resizing models.
- Tools/workflows: μP-based deployment kits; automated width-dependent hyperparameter mapping.
- Assumptions/dependencies: Domain data may deviate from random-data assumptions; validation on real-world distributions required.
Educational and research platforms for spectral learning dynamics
- Sectors: academia, education
- What: Teaching modules and simulators illustrating BBP transitions, bulk+outlier dynamics, and parameterization effects.
- Tools/products: Interactive notebooks; DMFT-based simulators.
- Assumptions/dependencies: Abstraction level must stay accessible; ongoing maintenance.
Open-source implementations of two-level DMFT predictors
- Sectors: academia, software, AI/ML industry
- What: Libraries to predict outlier emergence times and leading eigenvalue dynamics for given parameterizations, datasets (or surrogates), and training schedules.
- Tools/products: Python/JAX packages with APIs for Result 1 (bulk+spike) and extensive-output prototypes.
- Assumptions/dependencies: Theory exact in asymptotic limits; calibration needed for finite-width, real-data settings.

View Paper Prompt View All Prompts

Glossary

Asymptotically free: A property of random matrices whose independent factors behave freely in the large-dimension limit, enabling tractable product spectra. "At initialization, these matrices are statistically independent and asymptotically free~\cite{potters2020first}, so the one-step update is a product of independent random matrices."
BBP phase transition: The threshold at which low-rank “spikes” generate isolated eigenvalues that detach from the random-matrix bulk. "the emergence of isolated eigenvalues from the bulk as signal strength increases is known as the BBP phase transition~\cite{baik2005phase}."
BBP time: The earliest training step when an outlier eigenvalue exits the bulk. "we define the BBP time as the minimum value of $T$ where the set $O_S$ is non-empty (the time when an outlier exits the bulk)."
CompleteP: A feature-learning parameterization designed to preserve comparable dynamics and optimal hyperparameters across widths and depths. "Feature learning parameterizations such as $\mu$ P and CompleteP are designed so that training dynamics remain comparable across widths and depths, generating consistent optimal hyperparameters across model sizes \cite{yang2021tensor,yang2022tensor, dey2025don}."
Dynamical mean-field theory (DMFT): A self-consistent theoretical framework that reduces high-dimensional disordered dynamics to single-site stochastic processes with self-averaging statistics. "We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk."
Edge of stability (EoS): The boundary in learning-rate or curvature beyond which training becomes unstable. "including width-stable growth of the leading NTK mode toward the edge of stability (EoS)."
Edge of the spectrum: The boundary of the spectral bulk’s support (e.g., the largest bulk eigenvalue). "and show that edge of the spectrum still converges for sufficiently wide networks."
Gaussian Orthogonal Ensemble (GOE): A standard distribution over symmetric random matrices with Gaussian entries used in random matrix theory. "where $\bm M_0$ is a GOE matrix."
Hyperparameter transfer: The phenomenon where optimal hyperparameters (e.g., learning rate) remain stable across changes in model size. "This perspective is especially relevant for questions of width scaling and hyperparameter transfer."
Kernel/lazy regime: A training regime where parameters move minimally and dynamics are well-approximated by linearization, effectively yielding kernel methods. "In the lazy regime, parameter updates remain small and training is well approximated by linearization around initialization, leading to effectively kernel-based dynamics \citep{jacot2018neural, chizat2019lazy}."
Marchenko–Pastur law: The limiting eigenvalue distribution of sample covariance (Wishart) matrices that describes the spectral bulk of wide random weights. "the singular-value bulk of wide random weight matrices follows the Marchenko--Pastur law \citep{marvcenko1967distribution}."
Mean-field / μP scaling: A parameterization that preserves nontrivial feature learning even at infinite width by appropriately scaling outputs and updates. "mean-field / $\mu$ P scaling enables strong feature learning even at infinite width \cite{mei2019mean, geiger2020disentangling, yang2021tensor, bordelon2022self}."
Neural scaling laws: Empirical relationships describing how performance scales with model, data, and compute size. "The recent emphasis on neural scaling laws has made it increasingly important to understand which aspects of training are stable across model size and which are not \citep{kaplan2020scaling, hoffmann2022training}."
Neural Tangent Kernel (NTK): The kernel governing dynamics of networks linearized around initialization; its spectrum controls stability and training speed. "leading NTK mode toward the edge of stability (EoS)."
NTK eigenvalue: An eigenvalue of the NTK operator; the largest one often controls the maximal stable learning rate. "The edge-of-stability (EoS) literature~\cite{xing2018walksgd,Jastrzebski2020The,cohen2021gradient,Cohen2022AdaptiveGM,andreyev2025edgestochasticstabilityrevisiting} suggests that the maximum stable learning rate is controlled by sharpness or the NTK eigenvalue~\cite{ jiang2026understanding}."
NTK parameterization (NTK-P): A scaling where the NTK remains fixed as width increases, often yielding lazy training dynamics. "In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit."
Outlier (spectral outlier): An isolated eigenvalue/singular value that detaches from the spectral bulk, typically encoding learned low-dimensional structure. "These outliers encode low-dimensional directions shaped by learning, while the bulk continues to reflect high-dimensional randomness and finite-width effects."
Proportional high-dimensional limit: A scaling regime where width, input dimension, and dataset size diverge with fixed ratios, preserving finite-width effects. "deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios."
Proportional scaling: A limit where multiple dimensions grow jointly with fixed aspect ratios, enabling self-averaging analyses. "Assume proportional scaling $N_0, N_1 \to \infty$ with $N_1/N_0 = \alpha$ "
Rich regime: The feature-learning regime in which training significantly changes hidden representations and weights. "In this feature-learning regime (often termed the rich regime), training substantially modifies hidden weights and representations rather than merely fitting a linear model on top of static random features \cite{ghorbani2020neural, vyas2022limitations, montanari2025dynamical}."
Spiked ensembles: Random matrices with low-rank structured perturbations (“spikes”) added to a random bulk. "for spiked ensembles whose spike directions remain statistically dependent on the random bulk."
Spiked random matrix theory: The study of random matrices with low-rank perturbations and their outlier eigenvalues, including BBP phenomena. "spiked random matrix theory and the BBP transition describe when low-rank perturbations generate isolated eigenvalues outside a random bulk \citep{baik2005phase, benaych2012singular, baik2006eigenvalues, forner2025bbp}."
Stieltjes transform: A complex-analytic transform of a spectral distribution (resolvent), used to characterize bulk spectra. "where $\mathcal G(z) = \frac{\alpha}{2 z \sigma^2}\left[ z +\sigma^2(\alpha^{-1}-1) - \sqrt{ [z +\sigma^2(\alpha^{-1}-1)]^2 - 4 z \alpha^{-1} \sigma^2 } \right]$ is the Stieltjes transform of the Wishart matrix \cite{potters2020first, atanasov2024scaling}"
Two-level DMFT: An extension of DMFT that simultaneously tracks training dynamics and the spectral probe resolving bulk and outliers. "we extend these methods to a two-level DMFT for tracking singular/eigenvalue evolution in random matrices whose training-induced updates remain statistically coupled with their initial random components."
Wishart matrix: A random sample-covariance matrix whose spectrum is described by the Marchenko–Pastur law in high dimensions. "is the Stieltjes transform of the Wishart matrix \cite{potters2020first, atanasov2024scaling}"
Wigner model: A random symmetric matrix model (e.g., GOE) with semicircular bulk, often used in BBP analyses. "As a warm-up, we consider the standard rank-one spiked Wigner model"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

Summary

Spectral Dynamics in Deep Networks: Theoretical Frameworks and Empirical Regimes

Introduction

Two-Level DMFT for Spectral Dynamics

Infinite-Width Feature-Learning Regime

Proportional High-Dimensional Linear Networks

Extensive Output Regime and Bulk Restructuring

Implications for Hyperparameter Transfer and Learning Theory

Strong Claims and Contradictory Phenomena

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Spectral Dynamics in Deep Networks: A Simple Explanation

1) What is this paper about?

2) What questions does the paper try to answer?

3) How did they study it? (Methods explained with analogies)

4) What did they find, and why does it matter?

5) What are the implications?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets