Lecture notes: From Gaussian processes to feature learning
Abstract: These lecture notes develop the theory of learning in deep and recurrent neuronal networks from the point of view of Bayesian inference. The aim is to enable the reader to understand typical computations found in the literature in this field. Initial chapters develop the theoretical tools, such as probabilities, moment and cumulant-generating functions, and some notions of large deviation theory, as far as they are needed to understand collective network behavior with large numbers of parameters. The main part of the notes derives the theory of Bayesian inference for deep and recurrent networks, starting with the neural network Gaussian process (lazy-learning) limit, which is subsequently extended to study feature learning from the point of view of adaptive kernels. The notes also expose the link between the adaptive kernel approach and approaches of kernel rescaling.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is a set of teaching notes about how and why neural networks learn. The authors use ideas from physics to explain when networks behave in simple, predictable ways and when they truly learn useful “features” from data. A big focus is on two things:
- Gaussian processes: a neat mathematical way to make smooth predictions.
- Feature learning: how networks change their internal “view” of data to get better at tasks.
They show how deep networks (like the ones used for images) and recurrent networks (used for sequences) can be studied in one unified, physics-inspired framework.
Key Objectives
The notes aim to answer clear questions:
- How do neural networks learn from a limited amount of data and then generalize to new examples?
- When do networks do “lazy learning” (barely changing their weights) versus real feature learning (changing internal representations to match the task)?
- How can we describe learning using Bayesian inference (updating beliefs based on data)?
- What happens to learning when networks get very wide (many neurons) and when the size of the training set grows?
- Are there “phase transitions” in learning, like the sudden change when water freezes?
Methods and Approach (explained simply)
The authors build their story step by step, using everyday analogies:
- Bayesian inference: Think of the network’s weights as your “beliefs.” Before seeing data, you have a prior belief about them. After seeing the training data, you update to a posterior belief (your new, informed opinion).
- Posterior over outputs: Instead of tracking every weight, they mostly study what the network outputs look like after training. This is simpler and still very informative.
- Statistical physics tools:
- Partition function and free energy: Imagine scoring all possible settings of the network by how well they fit the data and how complicated they are. Free energy balances “fit” (energy) against “flexibility” or “number of options” (entropy).
- Phase transitions: Like water turning to ice, networks can suddenly switch behavior when you change something (for example, adding enough training examples or changing architecture).
- Law of large numbers and large deviations: With many neurons, averages become predictable (like flipping many coins). The notes use this to simplify complicated network behavior.
- Fokker–Planck equation and Langevin training: Picture a marble rolling in a landscape with a little random shaking. Over time, it settles into regions it prefers. That’s like weights moving under noisy gradient descent; the Fokker–Planck equation describes how their probability distribution changes and eventually settles.
- Gaussian processes (GPs): A GP is a way to predict outcomes that naturally prefers smooth functions. In very wide networks with few training points, the network behaves like a GP (specifically the Neural Network Gaussian Process, or NNGP). This is “lazy learning”—the network barely changes its weights from their starting values.
- Kernels: A kernel is a measure of similarity between inputs. In GP models, kernels control what kinds of patterns you can learn. The notes discuss two ways kernels change during feature learning:
- Scaling view: The kernel mostly keeps its shape but gets scaled (amplified or reduced).
- Adaptive kernel view: The kernel actively reshapes, adapting to the data to represent features better.
- They explain how these two views are connected.
- Deep and recurrent networks: The notes show how both types can be treated similarly in this framework, and they compare their behaviors.
Along the way, the notes teach important math tools like moments and cumulants (ways to summarize distributions), the Gaussian distribution, and Wick’s theorem (a method to compute averages in Gaussian models).
Main Findings and Why They Matter
- In the “infinite width, few samples” limit, deep and recurrent networks behave like Gaussian processes (NNGP). This predicts smooth outputs and explains “lazy learning.” It’s easy to analyze and gives basic insights into generalization and biases.
- True feature learning appears when both the network width and the number of training samples grow together (keeping their ratio fixed). In this regime:
- The kernel changes beyond simple scaling: it adapts to the data.
- Learning can undergo phase transitions with respect to data size or architectural choices, marking sharp changes like the “onset” of specialization.
- The notes unify two popular ways of describing feature learning—scaling and adaptive kernel—and show how they relate.
- They link Bayesian posterior analysis to the end state of noisy gradient descent training (Langevin dynamics), using the Fokker–Planck equation. This ties a theoretical, “belief-updating” view directly to how training actually behaves over time.
- The framework recovers known results like the NNGP limit and explains practical ideas such as:
- Smooth function bias: networks tend to prefer smooth solutions.
- Neural scaling laws: losses often decrease via power laws as data or model size increases.
- Critical initialization: placing networks near a “transition to chaos” can help signals and gradients flow well.
Implications and Potential Impact
Understanding neural networks with these physics-inspired tools can:
- Help design better architectures and training strategies before spending huge resources on training, saving time, money, and energy.
- Explain when and how feature learning reduces the number of examples needed to reach good accuracy.
- Predict sharp changes (phase transitions) in learning behavior, guiding choices like data size and model width.
- Provide a common language to compare deep and recurrent networks, and extend ideas to modern architectures like transformers and CNNs.
- Strengthen the bridge between machine learning and physics, giving researchers powerful tools to analyze complex models.
In short, these notes offer a clear path to understanding the “why” behind neural network performance, especially the jump from simple, smooth predictions to rich, task-specific feature learning.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a focused list of what remains missing, uncertain, or unexplored in the paper, formulated to be actionable for future research.
- Extend the theory from Langevin-trained stationary posteriors to realistic training dynamics (mini-batch SGD, momentum, Adam), including non-equilibrium transients and state-dependent/colored noise; derive a generalized Fokker–Planck description and quantify deviations from the Bayesian stationary posterior.
- Analyze the deterministic gradient-flow/NTK regime within the same field-theoretic framework; characterize feature-learning corrections under gradient flow and establish conditions under which NTK and Bayesian posterior predictions agree or diverge.
- Go beyond the assumption that train and test data are identically distributed; develop a theory for distributional shift, transfer learning, and domain adaptation, quantifying how kernel adaptation and generalization degrade or improve under mismatched distributions.
- Incorporate online, active, few-shot, curriculum, and continual learning into the theoretical framework; model time-varying data/targets and derive phase behavior and generalization dynamics over training time.
- Move beyond vanilla DNNs/RNNs to architecture-specific analyses:
- CNNs: account for locality and weight sharing in GP limits and feature-learning self-consistency; quantify changes to phase transitions and scaling.
- ResNets: include skip connections and depth-wise normalization effects on kernels and feature learning.
- Transformers: model attention (query-key-value statistics), positional encodings, and sequence length scaling; derive GP and kernel-adaptation limits with attention-induced constraints.
- GNNs: incorporate graph structure and message passing into GP/feature-learning theory.
- Link output posterior analysis to internal representation geometry; define measurable observables (e.g., layer-wise manifold curvature, Fisher information, representation alignment) and derive how kernel adaptation reshapes hidden representations.
- Integrate pruning, sparsification, and explicit sparsity penalties (L0/L1) into the Bayesian field-theoretic framework; characterize their impact on kernel adaptation, phase transitions, sample complexity, and generalization.
- Generalize beyond quadratic loss and Gaussian likelihoods to classification settings (cross-entropy, margin-based losses) and non-Gaussian observation models; derive corresponding posteriors and feature-learning corrections, and assess changes in phase behavior.
- Provide systematic finite-width/finite-depth/finite-sample analyses with non-asymptotic error bounds; quantify the accuracy of large-deviation/saddle-point approximations and identify regimes where asymptotic predictions fail.
- Compare parameterizations (standard, NTK, μP) in terms of feature-learning strength and kernel deformation; derive scaling relations for readout and hidden weights that govern the onset and magnitude of feature learning.
- For recurrent networks, extend beyond vanilla RNNs to gated models (LSTM/GRU), long-sequence limits, and non-ergodic inputs; analyze stability, vanishing/exploding gradients, and feature learning under realistic temporal statistics.
- Characterize phase transitions more fully: locate critical points, determine order and critical exponents, study universality classes across architectures/activations/losses, and perform finite-size scaling analyses.
- Move from data-agnostic/Gaussian i.i.d. priors to structured or data-dependent priors (orthogonal, low-rank, heavy-tailed, evidence-maximizing); quantify how prior choice affects GP limits, kernel adaptation, and equivalence to shallow models beyond linear networks.
- Establish existence, uniqueness, and stability of kernel-adaptation fixed points for common nonlinear activations; develop robust numerical schemes with convergence guarantees for solving the self-consistency equations.
- Systematically study how activation functions (ReLU, GELU, tanh, etc.) modulate feature-learning strength, kernel deformation, and phase behavior; derive analytic expressions where possible.
- Empirically validate theory-derived predictions (e.g., generalization-error scaling, phase transitions, feature-learning gains over NNGP) across multiple datasets, tasks, and architectures; design controlled experiments to isolate kernel-adaptation effects.
- Extend to multi-task and continual learning: model shared kernels across tasks, track kernel evolution with task sequences, and analyze catastrophic forgetting and potential phase transitions in multi-task regimes.
- Analyze uncertainty quantification and calibration under feature learning; derive conditions under which output posteriors are well-calibrated and how mismatched priors/likelihoods affect calibration and epistemic uncertainty.
- Incorporate regularization and normalization (weight decay, dropout, batch norm, layer norm) and data augmentation/invariance constraints into the Bayesian/theoretical framework; quantify their impact on kernel adaptation and generalization.
- Address computational scalability of kernel-adaptation: develop efficient approximations (e.g., low-rank/inducing-point methods, stochastic estimation, random features) suitable for large datasets and modern architectures.
- For classification, relate output posteriors to decision boundaries and margins; derive sample-complexity improvements from feature learning in terms of margin distributions and robust generalization guarantees.
- Study out-of-distribution behavior using posterior variance and other uncertainty measures; determine how feature learning affects OOD detection and robustness.
- Unify and generalize analyses of position-information phase transitions in sequence models; connect to different positional encoding schemes and quantify trade-offs between content and positional information encoding.
- Provide quantitative links to neural scaling laws: derive exponents/pre-factors for loss decline with data and model size within the presented framework under realistic data models, and explain deviations observed in practice.
Practical Applications
Overview
These lecture notes develop a Bayesian, statistical-physics-based toolkit for analyzing deep and recurrent neural networks (DNNs, RNNs). Core contributions include:
- A unified derivation of the Neural Network Gaussian Process (NNGP) limit for wide networks and its predictive use.
- Non-perturbative theories of feature learning in the joint limit of network width and dataset size (scaling and kernel-adaptation views) and their unification.
- Phase-transition perspectives on learning as a function of data/architecture, and on sequence processing (e.g., content vs. positional encoding).
- A connection between stochastic gradient-based training (Langevin/SGLD) and Bayesian posteriors via the Fokker–Planck formalism.
- A practical calculational toolbox (moments, cumulants, large deviations) for average-case analysis of neural learning.
Below are practical applications derived from these findings and methods.
Immediate Applications
The following applications can be deployed with current tools and practices, often by integrating Gaussian-process (GP) approximations, initialization theory, and uncertainty estimation into existing ML workflows.
- Architecture initialization at the “edge of chaos”
- Use case: Set weight/bias variances and activation functions to ensure signal and gradient propagation (critical initialization) for deep nets; stabilize training and improve trainability.
- Sectors: Software/AI, robotics, healthcare, finance.
- Tools/workflows: Add an “initialization check” step (compute variance maps or correlation maps across layers); adopt existing formulas to tune variance to the critical point.
- Assumptions/dependencies: Analysis assumes large width; independence/identical distribution (i.i.d.) of weights at init; results strongest for vanilla DNNs/RNNs.
- Trainless model diagnostics via NNGP regression
- Use case: Predict generalization behavior and uncertainty using NNGP/GP surrogates before costly training; compare architectures or activation choices on a dataset to guide selection.
- Sectors: Software/AI platforms, AutoML, education/research.
- Tools/workflows: Integrate GP-based kernels computed from architectures (e.g., Neural Tangents) into model-selection pipelines; run GP regression as a baseline performance estimate.
- Assumptions/dependencies: Infinite-width limit; i.i.d. train/test data; priors match analysis (e.g., Gaussian i.i.d. weights); limited feature-learning effects captured.
- Uncertainty quantification (UQ) via Bayesian posteriors over outputs
- Use case: Attach calibrated predictive variances to neural predictions; set abstention thresholds and risk-aware decision rules.
- Sectors: Healthcare diagnostics, autonomous systems, finance risk management.
- Tools/workflows: Use GP posterior (NNGP) in small-data or early-stage settings; approximate Bayesian posteriors via Langevin/SGLD training; deploy UQ in serving stacks.
- Assumptions/dependencies: Posterior calibration quality depends on prior choice and i.i.d. assumption; SGLD noise/step-size must approximate desired temperature.
- Compute/data budgeting with neural scaling laws
- Use case: Forecast loss vs. data/model size trade-offs to plan training budgets and timelines; reduce over- or under-provisioning of compute.
- Sectors: Software/AI operations, platform engineering; relevant to sustainability policy discussions.
- Tools/workflows: Fit scaling-law models from pilot runs; use theory to constrain exponents and extrapolate to target performance.
- Assumptions/dependencies: Scaling exponents can be task/model dependent; extrapolations safer within observed regimes.
- Stability and memory tuning for RNNs (unified with DNN analysis)
- Use case: Set spectral radius/weight variance to keep RNNs in non-chaotic, information-propagating regimes; improve time-series model stability.
- Sectors: IoT, finance (time-series), speech processing, industrial control.
- Tools/workflows: “Criticality checks” during architecture design; monitor correlation propagation across time steps.
- Assumptions/dependencies: Results derived for vanilla RNNs with i.i.d. weights; long-range dependencies and complex cells (e.g., LSTM/GRU) require adaptation.
- Training-noise and optimizer setting with Langevin/SGLD insights
- Use case: Use Fokker–Planck/Bayesian equivalence to choose noise levels (“temperature”), batch-size, and learning-rate schedules that better approximate posterior sampling.
- Sectors: Software/AI; privacy-aware ML where SGLD is used.
- Tools/workflows: Temperature-tuned SGLD; scheduler templates targeting stationary distributions.
- Assumptions/dependencies: Stationary/posterior equivalence holds under specific conditions (ergodicity, noise calibration); SGD != SGLD unless noise explicitly injected.
- Regime identification: lazy vs. feature-learning
- Use case: Use P/N ratio (samples per width) and theory to determine whether to rely on linear readouts (lazy regime) or to enable inner-layer adaptation (feature-learning); adjust width, readout strength, or training plan accordingly.
- Sectors: NLP/vision (linear probing vs. full fine-tuning), embedded ML.
- Tools/workflows: Pre-training checklist estimating α = P/N; adjust model size, regularization, and data strategy based on theory.
- Assumptions/dependencies: Joint limit reasoning (P, N → ∞ with α fixed) approximates finite cases; priors/activations matter.
- Education and upskilling
- Use case: Teach engineers and researchers a principled framework for analyzing networks; shorten iteration cycles by theoretical pre-analysis.
- Sectors: Academia, R&D teams, corporate training.
- Tools/workflows: Course modules on moments/cumulants, large deviations, NNGP, feature-learning theory.
- Assumptions/dependencies: Requires mathematical background; benefits accrue as teams apply tools to real workloads.
Long-Term Applications
These opportunities require further research, scaling, or engineering before widespread deployment.
- Theory-driven architecture compilers and pre-training simulators
- Use case: Tools that predict performance and suggest architecture/initialization choices (depth, width, activations, priors) to maximize feature learning and minimize compute before any training.
- Sectors: AutoML platforms, foundation-model engineering.
- Potential products: “Thermo-ML” simulators integrating NNGP, feature-learning corrections, and scaling laws into a design IDE.
- Assumptions/dependencies: Needs robust finite-width corrections and validated non-linear, non-asymptotic models across tasks.
- Certified UQ and safety frameworks for regulated AI
- Use case: Auditing and certification pipelines based on posterior predictive distributions and theoretically informed calibration, for clinical or financial approvals.
- Sectors: Healthcare, finance, critical infrastructure.
- Potential products: Compliance toolkits that report theory-backed UQ metrics and phase-transition risk flags.
- Assumptions/dependencies: Must extend beyond i.i.d. settings to handle distribution shift and model misspecification; regulatory alignment required.
- Phase-transition–guided curricula and active learning
- Use case: Dynamically schedule data and tasks to move models across learning-phase boundaries efficiently (e.g., from non-specialized to specialized regimes); reduce sample complexity.
- Sectors: Education tech, enterprise ML, scientific ML.
- Potential workflows: Curriculum planners informed by emergent learning phases for sequence and supervised tasks.
- Assumptions/dependencies: Requires non-equilibrium (training-dynamics) extensions beyond stationary/Langevin analyses.
- Hybrid GP–NN systems with adaptive kernels
- Use case: Data-efficient learners that adapt kernels during training, combining GP interpretability with NN capacity; enable on-device personalization and rapid adaptation.
- Sectors: Edge AI, robotics, industrial inspection.
- Potential products: Lightweight, fast-adapting models with explicit kernel adaptation modules.
- Assumptions/dependencies: Scalable kernel-adaptation solvers; memory-efficient approximations; validated on non-synthetic tasks.
- Transformer/sequence-model design via phase transitions
- Use case: Engineer positional encoding, data mixtures, and training regimes to control transitions between content vs. positional encoding dominance.
- Sectors: NLP, speech, genomics, code models.
- Potential workflows: Pre-training data composition and encoding selection guided by phase maps.
- Assumptions/dependencies: Current results are theoretical; need empirical validation and extensions to modern transformer variants.
- Energy-aware AI policy and compute planning
- Use case: Inform organizational and governmental policies on sustainable AI using theory-informed scaling laws to forecast energy and compute needs for desired accuracy.
- Sectors: Public policy, large AI labs, cloud providers.
- Potential tools: Policy dashboards linking performance targets to compute/energy budgets.
- Assumptions/dependencies: Scaling relations must be validated across modalities; policy impact requires standardized reporting.
- Hardware-/noise-aware model and chip co-design
- Use case: Align hardware noise/precision (effective temperature) and model criticality for stable, efficient training and inference; exploit Langevin analogs in neuromorphic/analog hardware.
- Sectors: Semiconductor, neuromorphic computing.
- Potential products: Co-designed accelerators with tunable noise matching theoretical optima.
- Assumptions/dependencies: Close collaboration between ML theory and hardware; robustness to real-world non-idealities.
- Robustness guarantees via large-deviation analysis
- Use case: New evaluation metrics and guarantees for generalization/robustness grounded in large-deviation theory applied to learning.
- Sectors: Safety-critical ML, scientific computing.
- Potential workflows: Risk assessment reports with large-deviation–based bounds.
- Assumptions/dependencies: Extending average-case results to practically meaningful guarantees; handling distributional shift.
Cross-Cutting Assumptions and Dependencies
- i.i.d. train/test assumption: The notes explicitly assume no distribution shift or transfer learning; applications to OOD settings require extensions.
- Asymptotic regimes: Many results rely on large-width (NNGP) and joint limit analyses; finite-width behaviors may deviate and need corrections.
- Priors and architectures: Results often assume Gaussian i.i.d. weight priors and vanilla DNN/RNN architectures; specialized architectures (CNNs, GNNs, transformers) need adapted analyses (some groundwork exists).
- Optimization dynamics: Bayesian–Langevin correspondence depends on explicit noise injection and stationarity; standard SGD without noise may not sample the posterior.
- Activation functions and parameterization: Criticality and feature-learning behavior depend on activation choice and parameter scaling (e.g., μP parameterization).
- Data regime: Feature-learning benefits hinge on the P/N ratio; in extremely small-data or extremely overparameterized regimes, lazy limits may dominate.
By integrating these theoretical tools into design, training, and governance workflows, organizations can reduce trial-and-error, improve stability and data efficiency, and make more informed trade-offs between accuracy, compute, and risk.
Glossary
- Action (statistical field theory): In field theory, the function in the exponent of a probability weight or path integral; for quadratics, often called a Gaussian action. "the exponent on the right hand side is often referred to as the 'action'; ... one may also call it a quadratic or Gaussian action."
- Bayes-optimal inference: Inference setting where the model (student) matches the data-generating process (teacher) and uses its prior, yielding optimal performance under Bayes’ rule. "exploits the Nishimori conditions that hold for Bayes-optimal inference, where student and teacher have the same architecture"
- Cramer-Rao bounds: Fundamental lower bounds on the variance of unbiased estimators, used to assess learnability and estimation limits. "Cramer-Rao bounds learning bounds \citep{Cramer1946,Rao1947,Seroussi22}"
- Cumulant: Statistical quantities capturing irreducible dependencies at each order, obtained from derivatives of the cumulant-generating function. "motivates the definition of cumulants in the following."
- Cumulant generating function: The log of the moment-generating function whose derivatives yield cumulants. "The function defined by \eqref{eq:def_W} is called the cumulant generating function."
- Deep kernel machines: Models that stack kernel mappings across layers, yielding hierarchical kernel transformations. "For deep kernel machines, \citep{Yang23_39380} find a trade-off between network prior and data term;"
- Dirac distribution: Generalized function δ used to represent point masses; for vectors, factorizes across coordinates. "the Dirac -distribution acting on a vector is understood as "
- Disordered systems: Physical systems with randomness in their parameters, often analyzed using statistical mechanics. "statistical physics, disordered systems, and large deviation theory"
- Edgeworth expansion: Asymptotic series improving Gaussian approximations by incorporating higher-order cumulants. "a perturbative approach based on the Edgeworth expansion that uses the strength of the non-Gaussian cumulants as an expansion parameter."
- Feature learning: Adaptation of internal representations based on targets and data, beyond fixed kernels like NNGP. "Understanding feature learning to its full extent is a field of active research"
- Fokker-Planck equation: Partial differential equation describing the time evolution of probability densities under stochastic dynamics. "derives the Fokker-Planck equation as a technique to study the time-evolution the probability distribution of network parameters"
- Free energy: Quantity combining energy and entropy that systems minimize; central to phase transitions in training. "such phase transitions arise from the competition between the energy and the entropy, from the principle of minimal free energy."
- Gaussian equivalence principle: Result enabling replacement of certain distributions by Gaussian ones under specific conditions. "which allows them to use the Gaussian equivalence principle \citep{Goldt20_14709} to obtain closed-form solutions."
- Gaussian process regression: Nonparametric Bayesian regression where functions are modeled as Gaussian processes. "leads to a particularly simple theory of Gaussian process regression."
- Inductive bias: Implicit preferences of a model toward certain solutions, affecting generalization behavior. "such as the inductive bias towards implementing smooth functions"
- Ising spin: Binary variable taking values ±1, a canonical model in statistical physics. "An Ising spin with has the moment-generating function "
- Kernel adaptation: Changes in the effective kernel induced by training and data, beyond fixed NNGP kernels. "The kernel adaptation theory of feature learning has been pioneered in \citet{seroussi2023separation_main}"
- Large deviation theory: Framework quantifying probabilities of rare events and asymptotics of distributions at large scale. "â large deviation theory \citep{Touchette09},"
- Langevin dynamics: Stochastic dynamics combining gradient forces with noise, leading to equilibrium distributions. "gradient-descent training with stochastic Langevin dynamics."
- Langevin training: Training procedure where parameters follow Langevin dynamics, yielding a stationary posterior. "a stochastic version of gradient descent known as Langevin training."
- Lyapunov exponents: Measures of sensitivity to initial conditions in dynamical systems, indicating chaos and stability. "such as Lyapunov exponents or robustness measures."
- Meijer-G functions: Special functions that generalize many classical functions, used in exact solutions of deep linear networks. "A rigorous non-asymptotic solution for deep linear networks in terms of Meijer-G functions \citep{Hanin23}"
- Moment generating function: Expectation of the exponential of a random variable; its derivatives produce moments. "The function is called the characteristic function or \textbf{moment generating function}"
- Neural Network Gaussian Process (NNGP): GP limit of wide neural networks with fixed features, capturing lazy learning. "The most prominent of which are the Neural Network Gaussian Process (NNGP \citep{Neal96,Williams98,Cho09})"
- Neural scaling laws: Empirical power laws describing how loss declines with data or model size. "the emergence of neural scaling laws \citep{Bahri_2024}; the latter are power laws that describe the decline of the loss"
- Neural Tangent Kernel (NTK): Kernel describing training dynamics under gradient flow in the infinite-width limit. "and the Neural Tangent Kernel (NTK \citep{Jacot18_8580})"
- Nishimori conditions: Identities that hold under Bayes-optimal inference in teacher–student models, enabling simplifications. "exploits the Nishimori conditions that hold for Bayes-optimal inference"
- Partition function: Normalizing integral over states in statistical mechanics; analogous to generating functions in Bayesian learning. "maps the problem of learning to the study of a partition function"
- Posterior distribution: Updated parameter (or output) distribution after observing data under a prior and likelihood. "The resulting parameter distribution is known as the \textbf{posterior distribution}."
- Saddle-point integration: Asymptotic method approximating integrals via stationary points of the exponent. "derive self-consistency equations either by saddle-point integration or by variational methods"
- Statistical field theory: Physics framework using fields and actions to study collective phenomena; applied here to learning. "Our focus on statistical field theory should not be confused with the long-standing and rich field of statistical learning theory"
- Teacher-student setting: Experimental setup where a student model learns from data generated by a teacher model. "For a teacher-student setting, \citep{ZavatoneVeth22_064118} show that in deep linear networks"
- Thermodynamic limit: Limit of infinitely large systems (e.g., many neurons) revealing simplified macroscopic behavior. "The thermodynamic limit of large numbers of neurons but with a limited number of training data points"
- Transition to chaos: Critical change in dynamical behavior from ordered to chaotic, relevant to signal propagation. "initializing networks at the critical point, the transition to chaos \citep{molgedey92_3717}"
- Vapnik–Chervonenkis dimension: Measure of capacity/complexity of a hypothesis class in statistical learning. "Vapnik-Chervonenkis dimension \citep{VapnikChervonenkis1968}"
- Variational methods: Optimization-based approximations to complex posteriors or integrals via tractable families. "either by saddle-point integration or by variational methods"
- Wick's theorem: Rule for computing moments of Gaussian variables by pairwise contractions. "as well as the Gaussian distribution as an important example and Wick's theorem."
- μP parametrization: Scaling scheme for parameters to align finite-width behavior with infinite-width limits. "to the parametrization \citep{Yang2021_icml} of the readout weights."
- Readout weights: Parameters in the final layer mapping internal representations to outputs. "studies the limit of very weak readout weights"
Collections
Sign up for free to add this paper to one or more collections.