Papers
Topics
Authors
Recent
Search
2000 character limit reached

Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

Published 4 Nov 2025 in stat.ML, cs.LG, math.PR, math.ST, and stat.TH | (2511.02258v1)

Abstract: This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD) for single-layer networks. Building on the seminal work of Saad and Solla, which analyzed the deterministic (ballistic) scaling limits of SGD corresponding to the gradient flow of the population loss, we focus on the critical scaling regime of the step size. Below this critical scale, the effective dynamics are governed by ballistic (ODE) limits, but at the critical scale, new correction term appears that changes the phase diagram. In this regime, near the fixed points, the corresponding diffusive (SDE) limits of the effective dynamics reduces to an Ornstein-Uhlenbeck process under certain conditions. These results highlight how the information exponent controls sample complexity and illustrates the limitations of deterministic scaling limit in capturing the stochastic fluctuations of high-dimensional learning dynamics.

Summary

  • The paper rigorously establishes scaling limits for SGD in high dimensions by linking discrete updates to deterministic ODE and stochastic SDE models.
  • The paper demonstrates that near critical points, stochastic fluctuations follow an Ornstein–Uhlenbeck process, highlighting the necessity of diffusion corrections.
  • The paper shows that the information exponent crucially influences sample complexity, guiding optimal step-size and activation function design for learning.

Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

Introduction and Context

This paper rigorously analyzes the scaling limits of online stochastic gradient descent (SGD) in high-dimensional single-layer neural networks, focusing on the critical regime where both the sample size and parameter dimension diverge. Building on the dynamical mean-field theory (DMFT) framework and the concept of order parameters introduced by Saad and Solla, the work advances the understanding of SGD dynamics beyond deterministic (ODE) limits by characterizing the stochastic fluctuations that arise at critical step-size scaling. The analysis is performed in the teacher-student setting, with particular attention to activation functions whose Hermite expansion yields an information exponent strictly greater than two, corresponding to challenging learning regimes with quasi-linear or polynomial sample complexity.

Mathematical Framework and Scaling Regimes

The SGD update is considered in the form:

Xk+1=Xk−δN∇LN(Xk;Yk+1),X_{k+1} = X_k - \delta_N \nabla L_N(X_k; Y_{k+1}),

where Xk∈RNX_k \in \mathbb{R}^N and δN=cδ/N\delta_N = c_\delta/N is the critical step-size scaling. The loss function LNL_N is quadratic, and the data is generated via a single-index model with Gaussian features and additive noise. The analysis tracks the evolution of summary statistics uN(x)=(m(x),r⊥2(x))u_N(x) = (m(x), r_\perp^2(x)), where m(x)=⟨x,X∗⟩m(x) = \langle x, X^* \rangle and r⊥2(x)=∥x∥2−m2(x)r_\perp^2(x) = \|x\|^2 - m^2(x), which fully characterize the population loss in this setting.

The key regularity conditions—localizability and asymptotic closability—ensure that the summary statistics admit well-defined scaling limits. The population loss Φ(x)\Phi(x) is assumed to have information exponent k≥2k \geq 2, determined by the Hermite expansion of the activation function ff.

Ballistic (ODE) and Diffusive (SDE) Limits

Ballistic Regime

For δN=1/N\delta_N = 1/N and activation functions with information exponent at least two, the summary statistics (m,r⊥2)(m, r_\perp^2) converge to the solution of a deterministic ODE system: dmdt=−2E[a1f′(a1m+a2r⊥)(f(a1m+a2r⊥)−f(a1))], dr⊥2dt=−4E[a2r⊥f′(a1m+a2r⊥)(f(a1m+a2r⊥)−f(a1))] +4E[f′2(a1m+a2r⊥)((f(a1m+a2r⊥)−f(a1))2+Cϵ)].\begin{aligned} \frac{dm}{dt} &= -2\mathbb{E}[a_1 f'(a_1 m + a_2 r_\perp)(f(a_1 m + a_2 r_\perp) - f(a_1))], \ \frac{dr_\perp^2}{dt} &= -4\mathbb{E}[a_2 r_\perp f'(a_1 m + a_2 r_\perp)(f(a_1 m + a_2 r_\perp) - f(a_1))] \ &\quad + 4\mathbb{E}[f'^2(a_1 m + a_2 r_\perp)((f(a_1 m + a_2 r_\perp) - f(a_1))^2 + C_\epsilon)]. \end{aligned} Random initialization from a Gaussian distribution leads to m(t)=0m(t) = 0 and a reduced ODE for r⊥2(t)r_\perp^2(t), which can be explicitly characterized using Hermite polynomial expansions and Stein's lemma.

Diffusive Regime and Fluctuations

In microscopic neighborhoods of the saddle set m=0m = 0, the ballistic approximation fails, and the effective dynamics are governed by an SDE. Rescaling mm as m~=Nm\tilde{m} = \sqrt{N} m, the limiting process for (m~,r⊥2)(\tilde{m}, r_\perp^2) is: dm~=−2m~E[f′2(a2r⊥)+f(a2r⊥)f′′(a2r⊥)]dt +2E[f′2(a2r⊥)f2(a2r⊥)]+E[f2(a2r⊥)](∥f∥2+2∥f′∥2+2⟨f,f′′⟩)dBt, dr⊥2dt=4E[f′2(a2r⊥)](Cϵ+∥f∥2−r⊥2)+4E[f′2(a2r⊥)f2(a2r⊥)]−4r⊥2E[f′′(a2r⊥)f(a2r⊥)].\begin{aligned} d\tilde{m} &= -2\tilde{m} \mathbb{E}[f'^2(a_2 r_\perp) + f(a_2 r_\perp) f''(a_2 r_\perp)] dt \ &\quad + 2\sqrt{\mathbb{E}[f'^2(a_2 r_\perp) f^2(a_2 r_\perp)] + \mathbb{E}[f^2(a_2 r_\perp)](\|f\|^2 + 2\|f'\|^2 + 2\langle f, f''\rangle)} dB_t, \ \frac{dr_\perp^2}{dt} &= 4\mathbb{E}[f'^2(a_2 r_\perp)](C_\epsilon + \|f\|^2 - r_\perp^2) + 4\mathbb{E}[f'^2(a_2 r_\perp) f^2(a_2 r_\perp)] - 4r_\perp^2 \mathbb{E}[f''(a_2 r_\perp) f(a_2 r_\perp)]. \end{aligned} At the fixed point r⊥∗=r⊥2(t∗)r_\perp^* = r_\perp^2(t^*), the dynamics of m~\tilde{m} reduce to a mean-reverting Ornstein–Uhlenbeck process, with explicit drift and volatility terms determined by the activation function and noise statistics.

Information Exponent and Sample Complexity

The information exponent kk is a central geometric quantity controlling the sample complexity and learning dynamics. For k≥2k \geq 2, the majority of data is consumed during the initial search phase, with the descent phase using a vanishing fraction of data as N→∞N \to \infty. This regime is characterized by slow recovery and significant stochastic fluctuations, which are not captured by deterministic DMFT alone. The analysis demonstrates that the deterministic scaling limit fails to describe the full recovery behavior, necessitating the diffusive SDE correction.

Implications and Future Directions

Theoretical Implications

  • The results establish a functional central limit theorem for SGD in high dimensions, rigorously connecting the discrete-time SGD trajectory to continuous-time ODE/SDE limits for summary statistics.
  • The emergence of Ornstein–Uhlenbeck dynamics near fixed points provides a precise characterization of stochastic fluctuations and phase transitions in the learning process.
  • The explicit dependence on the information exponent clarifies the geometric and statistical barriers to efficient learning in high dimensions.

Practical Implications

  • For practitioners, the findings indicate that in high-dimensional single-layer networks with complex activation functions, SGD may require substantially more samples to achieve non-trivial recovery, and the learning dynamics are dominated by stochastic fluctuations near critical points.
  • The analysis provides guidance for step-size selection: at the critical scaling δN=1/N\delta_N = 1/N, stochastic effects become dominant, and deterministic approximations are insufficient.
  • The results suggest that initialization strategies and activation function design can have a profound impact on sample complexity and convergence rates in high-dimensional regimes.

Future Developments

  • Extending the analysis to multi-layer networks and more general loss landscapes remains an open direction, with potential for uncovering new dynamical phase transitions and scaling laws.
  • Investigating the interplay between overparameterization, sample complexity, and stochastic fluctuations could yield further insights into the generalization properties of deep networks.
  • The framework may be adapted to study other first-order optimization algorithms and their scaling limits in high-dimensional statistical models.

Conclusion

This work provides a rigorous characterization of the scaling limits of online SGD in high-dimensional single-layer networks, revealing the limitations of deterministic DMFT and the necessity of stochastic corrections at critical step-size scaling. The explicit connection between the information exponent, sample complexity, and the emergence of Ornstein–Uhlenbeck dynamics near fixed points advances both the theoretical understanding and practical guidance for high-dimensional learning. The results lay the groundwork for future studies of stochastic optimization in complex, high-dimensional models, with implications for both theory and practice in machine learning and statistical inference.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 33 likes about this paper.