Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks
Published 4 Nov 2025 in stat.ML, cs.LG, math.PR, math.ST, and stat.TH | (2511.02258v1)
Abstract: This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD) for single-layer networks. Building on the seminal work of Saad and Solla, which analyzed the deterministic (ballistic) scaling limits of SGD corresponding to the gradient flow of the population loss, we focus on the critical scaling regime of the step size. Below this critical scale, the effective dynamics are governed by ballistic (ODE) limits, but at the critical scale, new correction term appears that changes the phase diagram. In this regime, near the fixed points, the corresponding diffusive (SDE) limits of the effective dynamics reduces to an Ornstein-Uhlenbeck process under certain conditions. These results highlight how the information exponent controls sample complexity and illustrates the limitations of deterministic scaling limit in capturing the stochastic fluctuations of high-dimensional learning dynamics.
The paper rigorously establishes scaling limits for SGD in high dimensions by linking discrete updates to deterministic ODE and stochastic SDE models.
The paper demonstrates that near critical points, stochastic fluctuations follow an Ornstein–Uhlenbeck process, highlighting the necessity of diffusion corrections.
The paper shows that the information exponent crucially influences sample complexity, guiding optimal step-size and activation function design for learning.
Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks
Introduction and Context
This paper rigorously analyzes the scaling limits of online stochastic gradient descent (SGD) in high-dimensional single-layer neural networks, focusing on the critical regime where both the sample size and parameter dimension diverge. Building on the dynamical mean-field theory (DMFT) framework and the concept of order parameters introduced by Saad and Solla, the work advances the understanding of SGD dynamics beyond deterministic (ODE) limits by characterizing the stochastic fluctuations that arise at critical step-size scaling. The analysis is performed in the teacher-student setting, with particular attention to activation functions whose Hermite expansion yields an information exponent strictly greater than two, corresponding to challenging learning regimes with quasi-linear or polynomial sample complexity.
The key regularity conditions—localizability and asymptotic closability—ensure that the summary statistics admit well-defined scaling limits. The population loss Φ(x) is assumed to have information exponent k≥2, determined by the Hermite expansion of the activation function f.
Ballistic (ODE) and Diffusive (SDE) Limits
Ballistic Regime
For δN​=1/N and activation functions with information exponent at least two, the summary statistics (m,r⊥2​) converge to the solution of a deterministic ODE system: dtdm​​=−2E[a1​f′(a1​m+a2​r⊥​)(f(a1​m+a2​r⊥​)−f(a1​))], dtdr⊥2​​​=−4E[a2​r⊥​f′(a1​m+a2​r⊥​)(f(a1​m+a2​r⊥​)−f(a1​))] ​+4E[f′2(a1​m+a2​r⊥​)((f(a1​m+a2​r⊥​)−f(a1​))2+Cϵ​)].​
Random initialization from a Gaussian distribution leads to m(t)=0 and a reduced ODE for r⊥2​(t), which can be explicitly characterized using Hermite polynomial expansions and Stein's lemma.
The information exponent k is a central geometric quantity controlling the sample complexity and learning dynamics. For k≥2, the majority of data is consumed during the initial search phase, with the descent phase using a vanishing fraction of data as N→∞. This regime is characterized by slow recovery and significant stochastic fluctuations, which are not captured by deterministic DMFT alone. The analysis demonstrates that the deterministic scaling limit fails to describe the full recovery behavior, necessitating the diffusive SDE correction.
Implications and Future Directions
Theoretical Implications
The results establish a functional central limit theorem for SGD in high dimensions, rigorously connecting the discrete-time SGD trajectory to continuous-time ODE/SDE limits for summary statistics.
The emergence of Ornstein–Uhlenbeck dynamics near fixed points provides a precise characterization of stochastic fluctuations and phase transitions in the learning process.
The explicit dependence on the information exponent clarifies the geometric and statistical barriers to efficient learning in high dimensions.
Practical Implications
For practitioners, the findings indicate that in high-dimensional single-layer networks with complex activation functions, SGD may require substantially more samples to achieve non-trivial recovery, and the learning dynamics are dominated by stochastic fluctuations near critical points.
The analysis provides guidance for step-size selection: at the critical scaling δN​=1/N, stochastic effects become dominant, and deterministic approximations are insufficient.
The results suggest that initialization strategies and activation function design can have a profound impact on sample complexity and convergence rates in high-dimensional regimes.
Future Developments
Extending the analysis to multi-layer networks and more general loss landscapes remains an open direction, with potential for uncovering new dynamical phase transitions and scaling laws.
Investigating the interplay between overparameterization, sample complexity, and stochastic fluctuations could yield further insights into the generalization properties of deep networks.
The framework may be adapted to study other first-order optimization algorithms and their scaling limits in high-dimensional statistical models.
Conclusion
This work provides a rigorous characterization of the scaling limits of online SGD in high-dimensional single-layer networks, revealing the limitations of deterministic DMFT and the necessity of stochastic corrections at critical step-size scaling. The explicit connection between the information exponent, sample complexity, and the emergence of Ornstein–Uhlenbeck dynamics near fixed points advances both the theoretical understanding and practical guidance for high-dimensional learning. The results lay the groundwork for future studies of stochastic optimization in complex, high-dimensional models, with implications for both theory and practice in machine learning and statistical inference.