Limit Theorems for Stochastic Gradient Descent
- The paper establishes that limit theorems for SGD—including LLNs, CLTs, and diffusion approximations—precisely characterize convergence behavior and uncertainty quantification.
- It demonstrates that fluctuation distributions vary with noise properties, step-size schedules, and high-dimensional scaling, impacting both variance reduction and stability.
- The results offer actionable insights for optimizing step-size, employing Polyak–Ruppert averaging, and tailoring SGD for deterministic or stochastic regimes in large-scale models.
Stochastic Gradient Descent (SGD) is a foundational paradigm for optimization and estimation in high-dimensional and large-scale learning and statistical tasks. The characterization of its long-time and large-sample behavior is governed by a suite of limit theorems, including laws of large numbers (LLNs), central limit theorems (CLTs), functional invariance principles, and diffusion approximations. These results quantify both the rates and the distributions of fluctuations around minimizers, under varying assumptions on noise, step-size schedules, and high-dimensional limits.
1. Classical Central Limit Theorems for SGD
SGD recursions targeting with i.i.d. data can be written as , with appropriately regular objective and martingale-difference noise . Under strong convexity, Lipschitz continuity, and vanishing step-size schedules , , classical limit theorems assert:
- Almost sure convergence to the population minimizer when and (Clémençon et al., 2015).
- A central limit theorem for the speed-normalized errors: , where and is the asymptotic covariance of the gradient noise (Anastasiou et al., 2019).
- For averaging schemes (Polyak–Ruppert), normality holds with explicit Wasserstein/ Kolmogorov convergence rates (Anastasiou et al., 2019).
These classical CLTs extend to SGD under unequal probability sampling (Horvitz–Thompson SGD), where the limit distribution is again Gaussian but with a Lyapunov equation for the covariance incorporating the inclusion probabilities and a precise variance reduction principle (Clémençon et al., 2015).
2. Functional and Pathwise Invariance Principles
Beyond pointwise CLTs, functional central limit theorems (FCLTs) describe the trajectory-level fluctuations of SGD on path spaces:
- For convex objectives with (possibly only local) smoothness at the minimizer, suitably rescaled trajectories converge in to the unique diffusion , encoding time-inhomogeneous correlations (Flamand et al., 17 Feb 2026).
- The Ornstein–Uhlenbeck (OU) process governs the limit, with covariance given by integral representations involving and the noise structure.
- Analogous FCLTs exist for SGLD in stationary or mixing environments, even without Markovianity of the data stream, yielding functional convergence in Skorokhod space (Lovas et al., 2022).
- Explicit non-asymptotic error bounds between the discrete SGD/SGLD process and the limiting OU can be attained via Stein’s method for exchangeable pairs, with accuracy in the univariate case and functional CLTs for time-averaged iterates (Wang et al., 21 Jan 2025).
These pathwise results enable the construction of temporal confidence regions and a more refined understanding of the influence of noise correlations and step-size choice on the entire SGD trajectory.
3. High-Dimensional and Critical Scaling Regimes
In modern regimes where both data dimension and sample size diverge, high-dimensional limit theorems track finite collections of summary statistics under SGD:
- For single- and multi-layer neural networks and mixture estimation, if the step-size , the dynamics of key statistics follow deterministic ODE ("ballistic" limit), found to be universal under broad data distributions with matching moments and delocalized initialization (Arous et al., 2022, Gheissari et al., 15 Dec 2025).
- At the critical scaling , stochastic fluctuations (diffusive regime) become significant. The rescaled process converges to a system of SDEs, typically linear around fixed points, often reducing to Ornstein–Uhlenbeck processes. This regime brings downstream effects:
- The phase diagram and the basin structure of SGD depend on both drift and covariance corrections.
- Fluctuations around critical points can induce noise-driven escapes or convergence to suboptimal manifolds (Arous et al., 2022, Rangriz, 4 Nov 2025).
- Universality holds for ODE fluctuations under suitable moment and initialization conditions, but may break for SDE fluctuations or poorly chosen initializations, as higher-order moment effects are amplified (Gheissari et al., 15 Dec 2025).
High-dimensional theorems reveal phenomena such as multimodal convergence timescales, probabilistic trapping in spurious optima, and the role of overparametrization in improving convergence probabilities (Arous et al., 2022).
4. Extensions: Averaging, Manifolds, and Nonstandard Noise
SGD limit theory extends in several crucial directions:
- Polyak–Ruppert averaging improves asymptotic efficiency, with CLTs guaranteeing optimal limiting variances in both isolated minimizer and stable manifold settings. When the limiting set is a manifold, only the normal directions admit classical CLT scaling, while tangential fluctuations vanish under standard normalization (Dereich et al., 2019).
- For non-quadratic or nonlinear stochastic approximation recursions, the correct limiting distribution and normalization can differ. In certain cases, the need arises to identify the scaling such that the rescaled stationary deviations converge to a non-Gaussian limit, dictated by a functional equation and matching SDE discretization (Chen et al., 2021).
- Infinite-variance gradient noise stalls the classical CLT: with regularly varying tails of index , SGD scales as , and the limiting law is the stationary measure of a stable OU process driven by a multivariate Lévy process, with characteristic function given by the Lévy–Khintchine formula (Blanchet et al., 2024). Examples in linear and logistic regression confirm this non-Gaussian heavy-tailed limit.
- Classical CLTs for momentum and Nesterov SGD variants require analogous Lyapunov and smoothness conditions, but converge to Gaussian limits with covariance given by block Lyapunov equations. However, time-averaged iterates only enjoy CLTs in linear/objectively "small-remainder" cases, and may fail for generic nonlinearities (Li et al., 2022).
5. Diffusion Approximations and SDE Embeddings
In the vanishing step-size limit (), SGD iterates can be consistently approximated by solutions to stochastic differential equations:
- Under minimal Lipschitz and moment conditions, the continuous interpolated process converges to the SDE (Lanconelli et al., 2022).
- The diffusion coefficient explicitly encodes the variance of per-sample gradients, and the convergence is justified via Stroock–Varadhan conditions and martingale central limit methods. Gaussianity of the limiting noise is a consequence of aggregation of mean-zero increments with finite moments.
In high-dimension, these diffusion limits are valid for fixed finite-dimensional projections of the parameter vector over finite time horizons, provided the underlying regularity and localizability conditions are met (Arous et al., 2022).
6. Practical Implications and Algorithmic Consequences
The rigorous limit theorems for SGD yield several actionable principles:
- The step-size exponent must exceed $1/2$ for optimal convergence rates in the mean-square error and for Gaussian fluctuation limits (Sirignano et al., 2017, Flamand et al., 17 Feb 2026).
- Variance reduction via mini-batch design or non-uniform sampling is quantifiable, allowing the practitioner to design SGD schemes with minimal asymptotic variance under computational constraints (Clémençon et al., 2015).
- Polyak–Ruppert averaging is recommended for optimal estimation efficiency, particularly in the presence of non-isolated minima or low-curvature manifolds (Dereich et al., 2019).
- For models suffering from heavy-tailed data or gradient noise, the practitioner must anticipate and accommodate stable (non-Gaussian) fluctuations, which can dramatically affect uncertainty quantification and confidence intervals (Blanchet et al., 2024).
- Functional CLTs enable the construction of confidence bands over parameter trajectories, providing a nuanced alternative to final-iterate intervals, and inform robust step-size tuning strategies for both statistical and computational optimality (Wang et al., 21 Jan 2025, Flamand et al., 17 Feb 2026).
- In high-dimensional neural and mixture models, the scaling regime ( vs ) critically determines whether the learning process behaves deterministically (ballistic regime) or is dominated by stochasticity (diffusive regime), which in turn influences escape rates, convergence to optimal or spurious solutions, and the value of overparametrization (Arous et al., 2022, Gheissari et al., 15 Dec 2025, Rangriz, 4 Nov 2025).
Table: Canonical Regimes and Limit Laws for SGD
| Regime | Scaling/Normalizing | Limit Process |
|---|---|---|
| Classical SGD (finite var.) | , | (CLT) |
| Polyak–Ruppert Averaging | , averages | (optimal) |
| Infinite-variance noise | Stable OU driven by Lévy process | |
| Constant stepsize, vanishing | Stationary Gaussian (Lyapunov) | |
| High-dim subcritical | — | ODE (gradient flow) |
| High-dim critical | SDE/OU for summary statistics |
This synthesis demonstrates that limit theorems for SGD deliver a rigorous probabilistic framework for precision and uncertainty quantification in modern optimization, intertwining algorithm design, statistical efficiency, and the geometry of learning in both classical and overparametrized high-dimensional settings.