Lecture notes: From Gaussian processes to feature learning

Published 13 Feb 2026 in cond-mat.dis-nn | (2602.12855v1)

Abstract: These lecture notes develop the theory of learning in deep and recurrent neuronal networks from the point of view of Bayesian inference. The aim is to enable the reader to understand typical computations found in the literature in this field. Initial chapters develop the theoretical tools, such as probabilities, moment and cumulant-generating functions, and some notions of large deviation theory, as far as they are needed to understand collective network behavior with large numbers of parameters. The main part of the notes derives the theory of Bayesian inference for deep and recurrent networks, starting with the neural network Gaussian process (lazy-learning) limit, which is subsequently extended to study feature learning from the point of view of adaptive kernels. The notes also expose the link between the adaptive kernel approach and approaches of kernel rescaling.

Abstract PDF Upgrade to Chat

Summary

The paper presents a unified framework that bridges Gaussian processes with adaptive feature learning in neural networks through statistical physics and Bayesian analysis.
It employs large deviation and saddle-point methods to derive analytic expressions for network outputs in both infinite-width and finite data regimes.
The study provides practical insights into performance prediction and kernel adaptation, offering guidelines for architecture selection and understanding model generalization.

From Gaussian Processes to Feature Learning in Neural Networks

Overview and Objectives

This work provides an extensive, physics-oriented exposition on the theoretical foundations bridging Gaussian processes (GPs) and feature learning in artificial neural networks (ANNs), focusing on both technical and conceptual advances. The notes unify tools from statistical field theory, large deviation principles, and the formal structure of Bayesian learning, connecting classical results on the Gaussian process limit of wide networks to recent developments in feature-learning theory. Special attention is given to the regimes where both width ( $N$ ) and dataset size ( $P$ ) scale in proportion, capturing the emergence of data-dependent representations absent from purely GP-based models.

Foundational Framework and Methodological Approach

The core theoretical setting is the Bayesian paradigm for supervised learning, in which weight priors and data likelihoods induce posterior distributions over network parameters and, consequently, over outputs. This approach is agnostic to specific training algorithms and naturally aligns with equilibrium distributions obtainable via Langevin dynamics, facilitating direct analytic study with field-theoretic techniques.

A central motif is the mapping between machine learning problems and statistical mechanics: the learning process is encoded as an inference over a partition function, with the cost or loss function acting as an energy term and weight configurations forming the microstates. Large-width limits enable the application of saddle-point and large deviation analyses, while treating the network output as the primary object of study—departing from more usual focuses on weight space geometry or function-space capacity.

The notes are organized to first build up moment and cumulant machinery, progress through the construction of joint and conditional distributions, and then ground these in the explicit case of Gaussian processes, which serve as both a mathematical base case and a tractable model for extremely wide networks. This scaffolding supports later chapters tackling the law of large numbers, deep and recurrent architectures, stochastic training dynamics, and, crucially, the mechanisms underlying feature learning.

Gaussian Processes and the Infinite-Width Limit

The formal link between neural networks and Gaussian processes is addressed via the so-called neural network Gaussian process (NNGP) correspondence. In the $N \to \infty$ , $P$ fixed limit, the joint distribution of outputs converges to a GP determined by the architecture and initialization, but fundamentally independent of the training labels—an instance of "lazy learning" where internal representations are frozen.

This equivalence is shown explicitly for both shallow and deep feedforward architectures using both heuristic arguments (via central limit theorem) and rigorous field-theoretic derivations. Similar technology is applied to recurrent networks, with appropriate modifications to account for weight sharing across time steps (temporal unfolding).

The Bayesian posterior of the outputs under the NNGP regime yields analytic expressions for the predictive mean and covariance. These are identical to those arising from linear regression with feature maps implied by the network activations at initialization, underscoring the absence of adaptive feature extraction in this regime.

Beyond GPs: Feature Learning and the $P/N$ Scaling

The primary departure from the GP paradigm occurs when $N$ and $P$ scale together with constant ratio $\alpha = P/N$ . In this setting, the data-dependent part of the objective function meaningfully perturbs the order parameters controlling the network's internal representations, leading to "feature learning": the adaptation of hidden layer kernels based on correlations between inputs and labels.

Two main theoretical approaches are reviewed and connected:

Kernel Scaling Theories: The impact of finite $P/N$ is incorporated as a global rescaling of the effective GP kernel, with the scaling parameter (e.g., $\|w\|^2$ in shallow linear cases) determined by a self-consistent equation involving both the prior and the data term [Li et al., 2021]. The solution is obtained via a saddle-point analysis in the large $N$ limit, and connects with phenomena such as neural scaling laws and loss curves.
Kernel Adaptation Theories: These frameworks allow the GP kernel itself to deform under the influence of data, rather than just being rescaled. The resulting adaptation equations are solved via variational methods or large deviation theory and make manifest the (often subtle) ways in which learned representations diverge from those of random GPs. In particular, the off-diagonal structure and outlier eigenvalues of the learned kernel capture taskspecific information absent from the NNGP [seroussi23_908, Fischer24_10761].

The notes detail how both approaches are mathematically linked, treating the first as a restriction of the latter to scaling solutions, and provide a technical bridge, following [Rubin25], which unifies the two viewpoints.

Dynamical Perspectives and Non-equilibrium Training

A notable section is the formal connection between Bayesian posteriors and training dynamics governed by stochastic Langevin equations. The Fokker-Planck framework is deployed to derive the stationary distribution over weights induced by stochastic gradient descent with noise and weight decay, recovering the analytic Bayesian fixed point. This insight bolsters the legitimacy of the Bayesian/statistical physics lens for analyzing realistic training protocols.

Phase Transitions, Memory, and the Economics of Learning

The field-theoretic treatment enables the identification and quantitative characterization of learning phase transitions as the amount of training data or other control parameters are varied. These transitions—marked by abrupt changes in capacity, generalization, or feature selectivity—are explained in terms of free energy minimization and the classical competition between energy and entropy.

The capacity of wide networks to generalize, as well as their tendency to implement smooth functions (an inductive bias captured in the GP prior), emerge as consequences of these analytic treatments. The framework also sets the stage for studying neural scaling laws, criticality, and the emergence (or failure) of internal representations in the high-capacity regime.

Theoretical and Practical Implications

Theoretical implications include:

Precise conditions for when neural networks exhibit "lazy" (GP-like) versus "rich" (feature-learning) behavior;
The identification of sufficient statistics (e.g., kernel order parameters, norms, overlaps) for analyzing and predicting network generalization;
Clarification of the universality (and its breakdown) of GP behavior as the data-to-width ratio increases.

Practical implications include:

Improved methods for predicting the performance and uncertainty of Bayesian neural networks without expensive retraining;
Insights for selecting architectures and hyperparameters based on their kernel adaptation dynamics;
Guidance for understanding the limits of architectural overparametrization and data efficiency.

The field-theoretic and large deviation approaches provide general techniques potentially extensible to modern architectures, including convolutional, graph-based, and attention-based networks, as currently under active investigation.

Outlook and Future Developments

As neural networks increase in both depth and width, and as applications become more data- and compute-intensive, analytic frameworks such as those described in this work become invaluable for rational model selection, principled uncertainty estimation, and understanding the emergence of intelligence-like behaviors through collective phenomena. Further, the theoretical machinery here points to fertile intersections with random matrix theory, high-dimensional statistics, and modern developments in non-equilibrium statistical mechanics and dynamical systems.

Open directions involve extending these analyses to cover:

Nonidentically distributed or structured data (distribution shift, transfer learning)
Explicitly non-Bayesian or non-equilibrium optimization procedures
Sparsity, pruning, and subnetwork selection
More expressive or compositional feature learning regimes

Conclusion

These lecture notes synthesize the rigorous theoretical tools required to bridge the gap between the analytically tractable infinite-width, GP regime and the empirically richer world of adaptive feature learning in deep networks. By systematically treating both shallow and deep, feedforward and recurrent architectures with a unified, field-theoretic language, the work lays out both the constraints and flexibilities of current theories, provides a roadmap for future analytic advances, and anchors the ongoing dialogue between statistical physics and modern machine learning (2602.12855).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is a set of teaching notes about how and why neural networks learn. The authors use ideas from physics to explain when networks behave in simple, predictable ways and when they truly learn useful “features” from data. A big focus is on two things:

Gaussian processes: a neat mathematical way to make smooth predictions.
Feature learning: how networks change their internal “view” of data to get better at tasks.

They show how deep networks (like the ones used for images) and recurrent networks (used for sequences) can be studied in one unified, physics-inspired framework.

Key Objectives

The notes aim to answer clear questions:

How do neural networks learn from a limited amount of data and then generalize to new examples?
When do networks do “lazy learning” (barely changing their weights) versus real feature learning (changing internal representations to match the task)?
How can we describe learning using Bayesian inference (updating beliefs based on data)?
What happens to learning when networks get very wide (many neurons) and when the size of the training set grows?
Are there “phase transitions” in learning, like the sudden change when water freezes?

Methods and Approach (explained simply)

The authors build their story step by step, using everyday analogies:

Bayesian inference: Think of the network’s weights as your “beliefs.” Before seeing data, you have a prior belief about them. After seeing the training data, you update to a posterior belief (your new, informed opinion).
Posterior over outputs: Instead of tracking every weight, they mostly study what the network outputs look like after training. This is simpler and still very informative.
Statistical physics tools:
- Partition function and free energy: Imagine scoring all possible settings of the network by how well they fit the data and how complicated they are. Free energy balances “fit” (energy) against “flexibility” or “number of options” (entropy).
- Phase transitions: Like water turning to ice, networks can suddenly switch behavior when you change something (for example, adding enough training examples or changing architecture).
- Law of large numbers and large deviations: With many neurons, averages become predictable (like flipping many coins). The notes use this to simplify complicated network behavior.
- Fokker–Planck equation and Langevin training: Picture a marble rolling in a landscape with a little random shaking. Over time, it settles into regions it prefers. That’s like weights moving under noisy gradient descent; the Fokker–Planck equation describes how their probability distribution changes and eventually settles.
Gaussian processes (GPs): A GP is a way to predict outcomes that naturally prefers smooth functions. In very wide networks with few training points, the network behaves like a GP (specifically the Neural Network Gaussian Process, or NNGP). This is “lazy learning”—the network barely changes its weights from their starting values.
Kernels: A kernel is a measure of similarity between inputs. In GP models, kernels control what kinds of patterns you can learn. The notes discuss two ways kernels change during feature learning:
- Scaling view: The kernel mostly keeps its shape but gets scaled (amplified or reduced).
- Adaptive kernel view: The kernel actively reshapes, adapting to the data to represent features better.
- They explain how these two views are connected.
Deep and recurrent networks: The notes show how both types can be treated similarly in this framework, and they compare their behaviors.

Along the way, the notes teach important math tools like moments and cumulants (ways to summarize distributions), the Gaussian distribution, and Wick’s theorem (a method to compute averages in Gaussian models).

Main Findings and Why They Matter

In the “infinite width, few samples” limit, deep and recurrent networks behave like Gaussian processes (NNGP). This predicts smooth outputs and explains “lazy learning.” It’s easy to analyze and gives basic insights into generalization and biases.
True feature learning appears when both the network width and the number of training samples grow together (keeping their ratio fixed). In this regime:
- The kernel changes beyond simple scaling: it adapts to the data.
- Learning can undergo phase transitions with respect to data size or architectural choices, marking sharp changes like the “onset” of specialization.
- The notes unify two popular ways of describing feature learning—scaling and adaptive kernel—and show how they relate.
They link Bayesian posterior analysis to the end state of noisy gradient descent training (Langevin dynamics), using the Fokker–Planck equation. This ties a theoretical, “belief-updating” view directly to how training actually behaves over time.
The framework recovers known results like the NNGP limit and explains practical ideas such as:
- Smooth function bias: networks tend to prefer smooth solutions.
- Neural scaling laws: losses often decrease via power laws as data or model size increases.
- Critical initialization: placing networks near a “transition to chaos” can help signals and gradients flow well.

Implications and Potential Impact

Understanding neural networks with these physics-inspired tools can:

Help design better architectures and training strategies before spending huge resources on training, saving time, money, and energy.
Explain when and how feature learning reduces the number of examples needed to reach good accuracy.
Predict sharp changes (phase transitions) in learning behavior, guiding choices like data size and model width.
Provide a common language to compare deep and recurrent networks, and extend ideas to modern architectures like transformers and CNNs.
Strengthen the bridge between machine learning and physics, giving researchers powerful tools to analyze complex models.

In short, these notes offer a clear path to understanding the “why” behind neural network performance, especially the jump from simple, smooth predictions to rich, task-specific feature learning.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains missing, uncertain, or unexplored in the paper, formulated to be actionable for future research.

Extend the theory from Langevin-trained stationary posteriors to realistic training dynamics (mini-batch SGD, momentum, Adam), including non-equilibrium transients and state-dependent/colored noise; derive a generalized Fokker–Planck description and quantify deviations from the Bayesian stationary posterior.
Analyze the deterministic gradient-flow/NTK regime within the same field-theoretic framework; characterize feature-learning corrections under gradient flow and establish conditions under which NTK and Bayesian posterior predictions agree or diverge.
Go beyond the assumption that train and test data are identically distributed; develop a theory for distributional shift, transfer learning, and domain adaptation, quantifying how kernel adaptation and generalization degrade or improve under mismatched distributions.
Incorporate online, active, few-shot, curriculum, and continual learning into the theoretical framework; model time-varying data/targets and derive phase behavior and generalization dynamics over training time.
Move beyond vanilla DNNs/RNNs to architecture-specific analyses:
- CNNs: account for locality and weight sharing in GP limits and feature-learning self-consistency; quantify changes to phase transitions and scaling.
- ResNets: include skip connections and depth-wise normalization effects on kernels and feature learning.
- Transformers: model attention (query-key-value statistics), positional encodings, and sequence length scaling; derive GP and kernel-adaptation limits with attention-induced constraints.
- GNNs: incorporate graph structure and message passing into GP/feature-learning theory.
Link output posterior analysis to internal representation geometry; define measurable observables (e.g., layer-wise manifold curvature, Fisher information, representation alignment) and derive how kernel adaptation reshapes hidden representations.
Integrate pruning, sparsification, and explicit sparsity penalties (L0/L1) into the Bayesian field-theoretic framework; characterize their impact on kernel adaptation, phase transitions, sample complexity, and generalization.
Generalize beyond quadratic loss and Gaussian likelihoods to classification settings (cross-entropy, margin-based losses) and non-Gaussian observation models; derive corresponding posteriors and feature-learning corrections, and assess changes in phase behavior.
Provide systematic finite-width/finite-depth/finite-sample analyses with non-asymptotic error bounds; quantify the accuracy of large-deviation/saddle-point approximations and identify regimes where asymptotic predictions fail.
Compare parameterizations (standard, NTK, μP) in terms of feature-learning strength and kernel deformation; derive scaling relations for readout and hidden weights that govern the onset and magnitude of feature learning.
For recurrent networks, extend beyond vanilla RNNs to gated models (LSTM/GRU), long-sequence limits, and non-ergodic inputs; analyze stability, vanishing/exploding gradients, and feature learning under realistic temporal statistics.
Characterize phase transitions more fully: locate critical points, determine order and critical exponents, study universality classes across architectures/activations/losses, and perform finite-size scaling analyses.
Move from data-agnostic/Gaussian i.i.d. priors to structured or data-dependent priors (orthogonal, low-rank, heavy-tailed, evidence-maximizing); quantify how prior choice affects GP limits, kernel adaptation, and equivalence to shallow models beyond linear networks.
Establish existence, uniqueness, and stability of kernel-adaptation fixed points for common nonlinear activations; develop robust numerical schemes with convergence guarantees for solving the self-consistency equations.
Systematically study how activation functions (ReLU, GELU, tanh, etc.) modulate feature-learning strength, kernel deformation, and phase behavior; derive analytic expressions where possible.
Empirically validate theory-derived predictions (e.g., generalization-error scaling, phase transitions, feature-learning gains over NNGP) across multiple datasets, tasks, and architectures; design controlled experiments to isolate kernel-adaptation effects.
Extend to multi-task and continual learning: model shared kernels across tasks, track kernel evolution with task sequences, and analyze catastrophic forgetting and potential phase transitions in multi-task regimes.
Analyze uncertainty quantification and calibration under feature learning; derive conditions under which output posteriors are well-calibrated and how mismatched priors/likelihoods affect calibration and epistemic uncertainty.
Incorporate regularization and normalization (weight decay, dropout, batch norm, layer norm) and data augmentation/invariance constraints into the Bayesian/theoretical framework; quantify their impact on kernel adaptation and generalization.
Address computational scalability of kernel-adaptation: develop efficient approximations (e.g., low-rank/inducing-point methods, stochastic estimation, random features) suitable for large datasets and modern architectures.
For classification, relate output posteriors to decision boundaries and margins; derive sample-complexity improvements from feature learning in terms of margin distributions and robust generalization guarantees.
Study out-of-distribution behavior using posterior variance and other uncertainty measures; determine how feature learning affects OOD detection and robustness.
Unify and generalize analyses of position-information phase transitions in sequence models; connect to different positional encoding schemes and quantify trade-offs between content and positional information encoding.
Provide quantitative links to neural scaling laws: derive exponents/pre-factors for loss decline with data and model size within the presented framework under realistic data models, and explain deviations observed in practice.

View Paper Prompt View All Prompts

Practical Applications

Overview

These lecture notes develop a Bayesian, statistical-physics-based toolkit for analyzing deep and recurrent neural networks (DNNs, RNNs). Core contributions include:

A unified derivation of the Neural Network Gaussian Process (NNGP) limit for wide networks and its predictive use.
Non-perturbative theories of feature learning in the joint limit of network width and dataset size (scaling and kernel-adaptation views) and their unification.
Phase-transition perspectives on learning as a function of data/architecture, and on sequence processing (e.g., content vs. positional encoding).
A connection between stochastic gradient-based training (Langevin/SGLD) and Bayesian posteriors via the Fokker–Planck formalism.
A practical calculational toolbox (moments, cumulants, large deviations) for average-case analysis of neural learning.

Below are practical applications derived from these findings and methods.

Immediate Applications

The following applications can be deployed with current tools and practices, often by integrating Gaussian-process (GP) approximations, initialization theory, and uncertainty estimation into existing ML workflows.

Architecture initialization at the “edge of chaos”
- Use case: Set weight/bias variances and activation functions to ensure signal and gradient propagation (critical initialization) for deep nets; stabilize training and improve trainability.
- Sectors: Software/AI, robotics, healthcare, finance.
- Tools/workflows: Add an “initialization check” step (compute variance maps or correlation maps across layers); adopt existing formulas to tune variance to the critical point.
- Assumptions/dependencies: Analysis assumes large width; independence/identical distribution (i.i.d.) of weights at init; results strongest for vanilla DNNs/RNNs.
Trainless model diagnostics via NNGP regression
- Use case: Predict generalization behavior and uncertainty using NNGP/GP surrogates before costly training; compare architectures or activation choices on a dataset to guide selection.
- Sectors: Software/AI platforms, AutoML, education/research.
- Tools/workflows: Integrate GP-based kernels computed from architectures (e.g., Neural Tangents) into model-selection pipelines; run GP regression as a baseline performance estimate.
- Assumptions/dependencies: Infinite-width limit; i.i.d. train/test data; priors match analysis (e.g., Gaussian i.i.d. weights); limited feature-learning effects captured.
Uncertainty quantification (UQ) via Bayesian posteriors over outputs
- Use case: Attach calibrated predictive variances to neural predictions; set abstention thresholds and risk-aware decision rules.
- Sectors: Healthcare diagnostics, autonomous systems, finance risk management.
- Tools/workflows: Use GP posterior (NNGP) in small-data or early-stage settings; approximate Bayesian posteriors via Langevin/SGLD training; deploy UQ in serving stacks.
- Assumptions/dependencies: Posterior calibration quality depends on prior choice and i.i.d. assumption; SGLD noise/step-size must approximate desired temperature.
Compute/data budgeting with neural scaling laws
- Use case: Forecast loss vs. data/model size trade-offs to plan training budgets and timelines; reduce over- or under-provisioning of compute.
- Sectors: Software/AI operations, platform engineering; relevant to sustainability policy discussions.
- Tools/workflows: Fit scaling-law models from pilot runs; use theory to constrain exponents and extrapolate to target performance.
- Assumptions/dependencies: Scaling exponents can be task/model dependent; extrapolations safer within observed regimes.
Stability and memory tuning for RNNs (unified with DNN analysis)
- Use case: Set spectral radius/weight variance to keep RNNs in non-chaotic, information-propagating regimes; improve time-series model stability.
- Sectors: IoT, finance (time-series), speech processing, industrial control.
- Tools/workflows: “Criticality checks” during architecture design; monitor correlation propagation across time steps.
- Assumptions/dependencies: Results derived for vanilla RNNs with i.i.d. weights; long-range dependencies and complex cells (e.g., LSTM/GRU) require adaptation.
Training-noise and optimizer setting with Langevin/SGLD insights
- Use case: Use Fokker–Planck/Bayesian equivalence to choose noise levels (“temperature”), batch-size, and learning-rate schedules that better approximate posterior sampling.
- Sectors: Software/AI; privacy-aware ML where SGLD is used.
- Tools/workflows: Temperature-tuned SGLD; scheduler templates targeting stationary distributions.
- Assumptions/dependencies: Stationary/posterior equivalence holds under specific conditions (ergodicity, noise calibration); SGD != SGLD unless noise explicitly injected.
Regime identification: lazy vs. feature-learning
- Use case: Use P/N ratio (samples per width) and theory to determine whether to rely on linear readouts (lazy regime) or to enable inner-layer adaptation (feature-learning); adjust width, readout strength, or training plan accordingly.
- Sectors: NLP/vision (linear probing vs. full fine-tuning), embedded ML.
- Tools/workflows: Pre-training checklist estimating α = P/N; adjust model size, regularization, and data strategy based on theory.
- Assumptions/dependencies: Joint limit reasoning (P, N → ∞ with α fixed) approximates finite cases; priors/activations matter.
Education and upskilling
- Use case: Teach engineers and researchers a principled framework for analyzing networks; shorten iteration cycles by theoretical pre-analysis.
- Sectors: Academia, R&D teams, corporate training.
- Tools/workflows: Course modules on moments/cumulants, large deviations, NNGP, feature-learning theory.
- Assumptions/dependencies: Requires mathematical background; benefits accrue as teams apply tools to real workloads.

Long-Term Applications

These opportunities require further research, scaling, or engineering before widespread deployment.

Theory-driven architecture compilers and pre-training simulators
- Use case: Tools that predict performance and suggest architecture/initialization choices (depth, width, activations, priors) to maximize feature learning and minimize compute before any training.
- Sectors: AutoML platforms, foundation-model engineering.
- Potential products: “Thermo-ML” simulators integrating NNGP, feature-learning corrections, and scaling laws into a design IDE.
- Assumptions/dependencies: Needs robust finite-width corrections and validated non-linear, non-asymptotic models across tasks.
Certified UQ and safety frameworks for regulated AI
- Use case: Auditing and certification pipelines based on posterior predictive distributions and theoretically informed calibration, for clinical or financial approvals.
- Sectors: Healthcare, finance, critical infrastructure.
- Potential products: Compliance toolkits that report theory-backed UQ metrics and phase-transition risk flags.
- Assumptions/dependencies: Must extend beyond i.i.d. settings to handle distribution shift and model misspecification; regulatory alignment required.
Phase-transition–guided curricula and active learning
- Use case: Dynamically schedule data and tasks to move models across learning-phase boundaries efficiently (e.g., from non-specialized to specialized regimes); reduce sample complexity.
- Sectors: Education tech, enterprise ML, scientific ML.
- Potential workflows: Curriculum planners informed by emergent learning phases for sequence and supervised tasks.
- Assumptions/dependencies: Requires non-equilibrium (training-dynamics) extensions beyond stationary/Langevin analyses.
Hybrid GP–NN systems with adaptive kernels
- Use case: Data-efficient learners that adapt kernels during training, combining GP interpretability with NN capacity; enable on-device personalization and rapid adaptation.
- Sectors: Edge AI, robotics, industrial inspection.
- Potential products: Lightweight, fast-adapting models with explicit kernel adaptation modules.
- Assumptions/dependencies: Scalable kernel-adaptation solvers; memory-efficient approximations; validated on non-synthetic tasks.
Transformer/sequence-model design via phase transitions
- Use case: Engineer positional encoding, data mixtures, and training regimes to control transitions between content vs. positional encoding dominance.
- Sectors: NLP, speech, genomics, code models.
- Potential workflows: Pre-training data composition and encoding selection guided by phase maps.
- Assumptions/dependencies: Current results are theoretical; need empirical validation and extensions to modern transformer variants.
Energy-aware AI policy and compute planning
- Use case: Inform organizational and governmental policies on sustainable AI using theory-informed scaling laws to forecast energy and compute needs for desired accuracy.
- Sectors: Public policy, large AI labs, cloud providers.
- Potential tools: Policy dashboards linking performance targets to compute/energy budgets.
- Assumptions/dependencies: Scaling relations must be validated across modalities; policy impact requires standardized reporting.
Hardware-/noise-aware model and chip co-design
- Use case: Align hardware noise/precision (effective temperature) and model criticality for stable, efficient training and inference; exploit Langevin analogs in neuromorphic/analog hardware.
- Sectors: Semiconductor, neuromorphic computing.
- Potential products: Co-designed accelerators with tunable noise matching theoretical optima.
- Assumptions/dependencies: Close collaboration between ML theory and hardware; robustness to real-world non-idealities.
Robustness guarantees via large-deviation analysis
- Use case: New evaluation metrics and guarantees for generalization/robustness grounded in large-deviation theory applied to learning.
- Sectors: Safety-critical ML, scientific computing.
- Potential workflows: Risk assessment reports with large-deviation–based bounds.
- Assumptions/dependencies: Extending average-case results to practically meaningful guarantees; handling distributional shift.

Cross-Cutting Assumptions and Dependencies

i.i.d. train/test assumption: The notes explicitly assume no distribution shift or transfer learning; applications to OOD settings require extensions.
Asymptotic regimes: Many results rely on large-width (NNGP) and joint limit analyses; finite-width behaviors may deviate and need corrections.
Priors and architectures: Results often assume Gaussian i.i.d. weight priors and vanilla DNN/RNN architectures; specialized architectures (CNNs, GNNs, transformers) need adapted analyses (some groundwork exists).
Optimization dynamics: Bayesian–Langevin correspondence depends on explicit noise injection and stationarity; standard SGD without noise may not sample the posterior.
Activation functions and parameterization: Criticality and feature-learning behavior depend on activation choice and parameter scaling (e.g., μP parameterization).
Data regime: Feature-learning benefits hinge on the P/N ratio; in extremely small-data or extremely overparameterized regimes, lazy limits may dominate.

By integrating these theoretical tools into design, training, and governance workflows, organizations can reduce trial-and-error, improve stability and data efficiency, and make more informed trade-offs between accuracy, compute, and risk.

View Paper Prompt View All Prompts

Glossary

Action (statistical field theory): In field theory, the function in the exponent of a probability weight or path integral; for quadratics, often called a Gaussian action. "the exponent on the right hand side is often referred to as the 'action'; ... one may also call it a quadratic or Gaussian action."
Bayes-optimal inference: Inference setting where the model (student) matches the data-generating process (teacher) and uses its prior, yielding optimal performance under Bayes’ rule. "exploits the Nishimori conditions that hold for Bayes-optimal inference, where student and teacher have the same architecture"
Cramer-Rao bounds: Fundamental lower bounds on the variance of unbiased estimators, used to assess learnability and estimation limits. "Cramer-Rao bounds learning bounds \citep{Cramer1946,Rao1947,Seroussi22}"
Cumulant: Statistical quantities capturing irreducible dependencies at each order, obtained from derivatives of the cumulant-generating function. "motivates the definition of cumulants in the following."
Cumulant generating function: The log of the moment-generating function whose derivatives yield cumulants. "The function $W$ defined by \eqref{eq:def_W} is called the cumulant generating function."
Deep kernel machines: Models that stack kernel mappings across layers, yielding hierarchical kernel transformations. "For deep kernel machines, \citep{Yang23_39380} find a trade-off between network prior and data term;"
Dirac distribution: Generalized function δ used to represent point masses; for vectors, factorizes across coordinates. "the Dirac $\delta$ -distribution acting on a vector is understood as $\delta(x)=\Pi_{i=1}^{N}\delta(x_{i}).$ "
Disordered systems: Physical systems with randomness in their parameters, often analyzed using statistical mechanics. "statistical physics, disordered systems, and large deviation theory"
Edgeworth expansion: Asymptotic series improving Gaussian approximations by incorporating higher-order cumulants. "a perturbative approach based on the Edgeworth expansion that uses the strength of the non-Gaussian cumulants as an expansion parameter."
Feature learning: Adaptation of internal representations based on targets and data, beyond fixed kernels like NNGP. "Understanding feature learning to its full extent is a field of active research"
Fokker-Planck equation: Partial differential equation describing the time evolution of probability densities under stochastic dynamics. "derives the Fokker-Planck equation as a technique to study the time-evolution the probability distribution of network parameters"
Free energy: Quantity combining energy and entropy that systems minimize; central to phase transitions in training. "such phase transitions arise from the competition between the energy and the entropy, from the principle of minimal free energy."
Gaussian equivalence principle: Result enabling replacement of certain distributions by Gaussian ones under specific conditions. "which allows them to use the Gaussian equivalence principle \citep{Goldt20_14709} to obtain closed-form solutions."
Gaussian process regression: Nonparametric Bayesian regression where functions are modeled as Gaussian processes. "leads to a particularly simple theory of Gaussian process regression."
Inductive bias: Implicit preferences of a model toward certain solutions, affecting generalization behavior. "such as the inductive bias towards implementing smooth functions"
Ising spin: Binary variable taking values ±1, a canonical model in statistical physics. "An Ising spin $s\in\{-1,1\}$ with $p(s=1)=\frac{1}{2}$ has the moment-generating function $Z(j)=\frac{1}{2}e^{j}+\frac{1}{2}e^{-j}=\cosh(j)$ "
Kernel adaptation: Changes in the effective kernel induced by training and data, beyond fixed NNGP kernels. "The kernel adaptation theory of feature learning has been pioneered in \citet{seroussi2023separation_main}"
Large deviation theory: Framework quantifying probabilities of rare events and asymptotics of distributions at large scale. "â large deviation theory \citep{Touchette09},"
Langevin dynamics: Stochastic dynamics combining gradient forces with noise, leading to equilibrium distributions. "gradient-descent training with stochastic Langevin dynamics."
Langevin training: Training procedure where parameters follow Langevin dynamics, yielding a stationary posterior. "a stochastic version of gradient descent known as Langevin training."
Lyapunov exponents: Measures of sensitivity to initial conditions in dynamical systems, indicating chaos and stability. "such as Lyapunov exponents or robustness measures."
Meijer-G functions: Special functions that generalize many classical functions, used in exact solutions of deep linear networks. "A rigorous non-asymptotic solution for deep linear networks in terms of Meijer-G functions \citep{Hanin23}"
Moment generating function: Expectation of the exponential of a random variable; its derivatives produce moments. "The function $Z$ is called the characteristic function or \textbf{moment generating function}"
Neural Network Gaussian Process (NNGP): GP limit of wide neural networks with fixed features, capturing lazy learning. "The most prominent of which are the Neural Network Gaussian Process (NNGP \citep{Neal96,Williams98,Cho09})"
Neural scaling laws: Empirical power laws describing how loss declines with data or model size. "the emergence of neural scaling laws \citep{Bahri_2024}; the latter are power laws that describe the decline of the loss"
Neural Tangent Kernel (NTK): Kernel describing training dynamics under gradient flow in the infinite-width limit. "and the Neural Tangent Kernel (NTK \citep{Jacot18_8580})"
Nishimori conditions: Identities that hold under Bayes-optimal inference in teacher–student models, enabling simplifications. "exploits the Nishimori conditions that hold for Bayes-optimal inference"
Partition function: Normalizing integral over states in statistical mechanics; analogous to generating functions in Bayesian learning. "maps the problem of learning to the study of a partition function"
Posterior distribution: Updated parameter (or output) distribution after observing data under a prior and likelihood. "The resulting parameter distribution is known as the \textbf{posterior distribution}."
Saddle-point integration: Asymptotic method approximating integrals via stationary points of the exponent. "derive self-consistency equations either by saddle-point integration or by variational methods"
Statistical field theory: Physics framework using fields and actions to study collective phenomena; applied here to learning. "Our focus on statistical field theory should not be confused with the long-standing and rich field of statistical learning theory"
Teacher-student setting: Experimental setup where a student model learns from data generated by a teacher model. "For a teacher-student setting, \citep{ZavatoneVeth22_064118} show that in deep linear networks"
Thermodynamic limit: Limit of infinitely large systems (e.g., many neurons) revealing simplified macroscopic behavior. "The thermodynamic limit of large numbers of neurons but with a limited number of training data points"
Transition to chaos: Critical change in dynamical behavior from ordered to chaotic, relevant to signal propagation. "initializing networks at the critical point, the transition to chaos \citep{molgedey92_3717}"
Vapnik–Chervonenkis dimension: Measure of capacity/complexity of a hypothesis class in statistical learning. "Vapnik-Chervonenkis dimension \citep{VapnikChervonenkis1968}"
Variational methods: Optimization-based approximations to complex posteriors or integrals via tractable families. "either by saddle-point integration or by variational methods"
Wick's theorem: Rule for computing moments of Gaussian variables by pairwise contractions. "as well as the Gaussian distribution as an important example and Wick's theorem."
μP parametrization: Scaling scheme for parameters to align finite-width behavior with infinite-width limits. "to the $\mu P$ parametrization \citep{Yang2021_icml} of the readout weights."
Readout weights: Parameters in the final layer mapping internal representations to outputs. "studies the limit of very weak readout weights"

Lecture notes: From Gaussian processes to feature learning

Summary

From Gaussian Processes to Feature Learning in Neural Networks

Overview and Objectives

Foundational Framework and Methodological Approach

Gaussian Processes and the Infinite-Width Limit

Beyond GPs: Feature Learning and the $P/N$ Scaling

Dynamical Perspectives and Non-equilibrium Training

Phase Transitions, Memory, and the Economics of Learning

Theoretical and Practical Implications

Outlook and Future Developments

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Objectives

Methods and Approach (explained simply)

Main Findings and Why They Matter

Implications and Potential Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-Cutting Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Authors (4)

Collections

Tweets

Lecture notes: From Gaussian processes to feature learning

Summary

From Gaussian Processes to Feature Learning in Neural Networks

Overview and Objectives

Foundational Framework and Methodological Approach

Gaussian Processes and the Infinite-Width Limit

Beyond GPs: Feature Learning and the P/NP/NP/N Scaling

Dynamical Perspectives and Non-equilibrium Training

Phase Transitions, Memory, and the Economics of Learning

Theoretical and Practical Implications

Outlook and Future Developments

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Objectives

Methods and Approach (explained simply)

Main Findings and Why They Matter

Implications and Potential Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-Cutting Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets

Beyond GPs: Feature Learning and the $P/N$ Scaling