Papers
Topics
Authors
Recent
Search
2000 character limit reached

Limit Theorems for Stochastic Gradient Descent

Updated 18 February 2026
  • The paper establishes that limit theorems for SGD—including LLNs, CLTs, and diffusion approximations—precisely characterize convergence behavior and uncertainty quantification.
  • It demonstrates that fluctuation distributions vary with noise properties, step-size schedules, and high-dimensional scaling, impacting both variance reduction and stability.
  • The results offer actionable insights for optimizing step-size, employing Polyak–Ruppert averaging, and tailoring SGD for deterministic or stochastic regimes in large-scale models.

Stochastic Gradient Descent (SGD) is a foundational paradigm for optimization and estimation in high-dimensional and large-scale learning and statistical tasks. The characterization of its long-time and large-sample behavior is governed by a suite of limit theorems, including laws of large numbers (LLNs), central limit theorems (CLTs), functional invariance principles, and diffusion approximations. These results quantify both the rates and the distributions of fluctuations around minimizers, under varying assumptions on noise, step-size schedules, and high-dimensional limits.

1. Classical Central Limit Theorems for SGD

SGD recursions targeting minθf(θ)=E[F(θ,Z)]\min_\theta f(\theta) = \mathbb{E}[F(\theta,Z)] with i.i.d. data can be written as θt+1=θtγt(f(θt)+ζt)\theta_{t+1} = \theta_t - \gamma_t ( \nabla f(\theta_t) + \zeta_t ), with appropriately regular objective and martingale-difference noise ζt\zeta_t. Under strong convexity, Lipschitz continuity, and vanishing step-size schedules γt=tα\gamma_t = t^{-\alpha}, α(1/2,1]\alpha \in (1/2,1], classical limit theorems assert:

  • Almost sure convergence to the population minimizer θ\theta^* when tγt=\sum_t \gamma_t = \infty and tγt2<\sum_t \gamma_t^2 < \infty (Clémençon et al., 2015).
  • A central limit theorem for the speed-normalized errors: n(θˉnθ)N(0,H1ΣH1)\sqrt{n} (\bar{\theta}_n - \theta^*) \Rightarrow N(0, H^{-1} \Sigma H^{-1}), where H=2f(θ)H = \nabla^2 f(\theta^*) and Σ\Sigma is the asymptotic covariance of the gradient noise (Anastasiou et al., 2019).
  • For averaging schemes (Polyak–Ruppert), normality holds with O(d2/n)O(d^2/\sqrt{n}) explicit Wasserstein/ Kolmogorov convergence rates (Anastasiou et al., 2019).

These classical CLTs extend to SGD under unequal probability sampling (Horvitz–Thompson SGD), where the limit distribution is again Gaussian but with a Lyapunov equation for the covariance incorporating the inclusion probabilities and a precise variance reduction principle (Clémençon et al., 2015).

2. Functional and Pathwise Invariance Principles

Beyond pointwise CLTs, functional central limit theorems (FCLTs) describe the trajectory-level fluctuations of SGD on path spaces:

  • For convex objectives with (possibly only local) smoothness at the minimizer, suitably rescaled trajectories YtnY^n_t converge in C0((0,),Rd)C^0((0,\infty),\mathbb{R}^d) to the unique diffusion dYt=t1(IdδH)Ytdt+δΓ1/2dBtdY_t = t^{-1}(I_d - \delta H) Y_t dt + \delta \Gamma^{1/2} dB_t, encoding time-inhomogeneous correlations (Flamand et al., 17 Feb 2026).
  • The Ornstein–Uhlenbeck (OU) process governs the limit, with covariance given by integral representations involving HH and the noise structure.
  • Analogous FCLTs exist for SGLD in stationary or mixing environments, even without Markovianity of the data stream, yielding functional convergence in Skorokhod space (Lovas et al., 2022).
  • Explicit non-asymptotic error bounds between the discrete SGD/SGLD process and the limiting OU can be attained via Stein’s method for exchangeable pairs, with O(hlog(1/h))O(\sqrt{h\log(1/h)}) accuracy in the univariate case and functional CLTs for time-averaged iterates (Wang et al., 21 Jan 2025).

These pathwise results enable the construction of temporal confidence regions and a more refined understanding of the influence of noise correlations and step-size choice on the entire SGD trajectory.

3. High-Dimensional and Critical Scaling Regimes

In modern regimes where both data dimension and sample size diverge, high-dimensional limit theorems track finite collections of summary statistics under SGD:

  • For single- and multi-layer neural networks and mixture estimation, if the step-size δ1/d\delta \ll 1/d, the dynamics of key statistics follow deterministic ODE ("ballistic" limit), found to be universal under broad data distributions with matching moments and delocalized initialization (Arous et al., 2022, Gheissari et al., 15 Dec 2025).
  • At the critical scaling δ=c/d\delta = c/d, stochastic fluctuations (diffusive regime) become significant. The rescaled process converges to a system of SDEs, typically linear around fixed points, often reducing to Ornstein–Uhlenbeck processes. This regime brings downstream effects:
    • The phase diagram and the basin structure of SGD depend on both drift and covariance corrections.
    • Fluctuations around critical points can induce noise-driven escapes or convergence to suboptimal manifolds (Arous et al., 2022, Rangriz, 4 Nov 2025).
  • Universality holds for ODE fluctuations under suitable moment and initialization conditions, but may break for SDE fluctuations or poorly chosen initializations, as higher-order moment effects are amplified (Gheissari et al., 15 Dec 2025).

High-dimensional theorems reveal phenomena such as multimodal convergence timescales, probabilistic trapping in spurious optima, and the role of overparametrization in improving convergence probabilities (Arous et al., 2022).

4. Extensions: Averaging, Manifolds, and Nonstandard Noise

SGD limit theory extends in several crucial directions:

  • Polyak–Ruppert averaging improves asymptotic efficiency, with CLTs guaranteeing optimal limiting variances H1ΣH1H^{-1}\Sigma H^{-1} in both isolated minimizer and stable manifold settings. When the limiting set is a manifold, only the normal directions admit classical CLT scaling, while tangential fluctuations vanish under standard normalization (Dereich et al., 2019).
  • For non-quadratic or nonlinear stochastic approximation recursions, the correct limiting distribution and normalization can differ. In certain cases, the need arises to identify the scaling g(α)g(\alpha) such that the rescaled stationary deviations converge to a non-Gaussian limit, dictated by a functional equation and matching SDE discretization (Chen et al., 2021).
  • Infinite-variance gradient noise stalls the classical CLT: with regularly varying tails of index α(1,2)\alpha\in(1,2), SGD scales as n11/α/b1(n)n^{1-1/\alpha}/b_1(n), and the limiting law is the stationary measure of a stable OU process driven by a multivariate Lévy process, with characteristic function given by the Lévy–Khintchine formula (Blanchet et al., 2024). Examples in linear and logistic regression confirm this non-Gaussian heavy-tailed limit.
  • Classical CLTs for momentum and Nesterov SGD variants require analogous Lyapunov and smoothness conditions, but converge to Gaussian limits with covariance given by block Lyapunov equations. However, time-averaged iterates only enjoy CLTs in linear/objectively "small-remainder" cases, and may fail for generic nonlinearities (Li et al., 2022).

5. Diffusion Approximations and SDE Embeddings

In the vanishing step-size limit (η0\eta\to0), SGD iterates can be consistently approximated by solutions to stochastic differential equations:

  • Under minimal Lipschitz and moment conditions, the continuous interpolated process {Xt(η)}\{X^{(\eta)}_t\} converges to the SDE dXt=f(Xt)dt+ηˉΣ(Xt)dBtdX_t=-\nabla f(X_t)dt+\sqrt{\bar\eta\,\Sigma(X_t)}\,dB_t (Lanconelli et al., 2022).
  • The diffusion coefficient Σ(x)\Sigma(x) explicitly encodes the variance of per-sample gradients, and the convergence is justified via Stroock–Varadhan conditions and martingale central limit methods. Gaussianity of the limiting noise is a consequence of aggregation of mean-zero increments with finite moments.

In high-dimension, these diffusion limits are valid for fixed finite-dimensional projections of the parameter vector over finite time horizons, provided the underlying regularity and localizability conditions are met (Arous et al., 2022).

6. Practical Implications and Algorithmic Consequences

The rigorous limit theorems for SGD yield several actionable principles:

  • The step-size exponent must exceed $1/2$ for optimal O(t1/2)O(t^{-1/2}) convergence rates in the mean-square error and for Gaussian fluctuation limits (Sirignano et al., 2017, Flamand et al., 17 Feb 2026).
  • Variance reduction via mini-batch design or non-uniform sampling is quantifiable, allowing the practitioner to design SGD schemes with minimal asymptotic variance under computational constraints (Clémençon et al., 2015).
  • Polyak–Ruppert averaging is recommended for optimal estimation efficiency, particularly in the presence of non-isolated minima or low-curvature manifolds (Dereich et al., 2019).
  • For models suffering from heavy-tailed data or gradient noise, the practitioner must anticipate and accommodate stable (non-Gaussian) fluctuations, which can dramatically affect uncertainty quantification and confidence intervals (Blanchet et al., 2024).
  • Functional CLTs enable the construction of confidence bands over parameter trajectories, providing a nuanced alternative to final-iterate intervals, and inform robust step-size tuning strategies for both statistical and computational optimality (Wang et al., 21 Jan 2025, Flamand et al., 17 Feb 2026).
  • In high-dimensional neural and mixture models, the scaling regime (δ\delta vs dd) critically determines whether the learning process behaves deterministically (ballistic regime) or is dominated by stochasticity (diffusive regime), which in turn influences escape rates, convergence to optimal or spurious solutions, and the value of overparametrization (Arous et al., 2022, Gheissari et al., 15 Dec 2025, Rangriz, 4 Nov 2025).

Table: Canonical Regimes and Limit Laws for SGD

Regime Scaling/Normalizing Limit Process
Classical SGD (finite var.) n\sqrt{n}, t\sqrt{t} N(0,H1ΣH1)N(0, H^{-1}\Sigma H^{-1}) (CLT)
Polyak–Ruppert Averaging n\sqrt{n}, averages N(0,H1ΣH1)N(0, H^{-1}\Sigma H^{-1}) (optimal)
Infinite-variance noise n11/αn^{1-1/\alpha} Stable OU driven by Lévy process
Constant stepsize, vanishing 1/α1/\sqrt{\alpha} Stationary Gaussian (Lyapunov)
High-dim subcritical ODE (gradient flow)
High-dim critical 1/d1/\sqrt{d} SDE/OU for summary statistics

This synthesis demonstrates that limit theorems for SGD deliver a rigorous probabilistic framework for precision and uncertainty quantification in modern optimization, intertwining algorithm design, statistical efficiency, and the geometry of learning in both classical and overparametrized high-dimensional settings.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Limit Theorems for Stochastic Gradient Descent.