High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes

Published 6 Nov 2025 in stat.ML and cs.LG | (2511.03952v1)

Abstract: We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.

Abstract PDF Upgrade to Chat

Summary

The paper extends scaling limit frameworks by incorporating momentum methods and adaptive step-sizes in SGD.
It uses Gaussian approximations and case studies like Spiked Tensor PCA to detail when SGD-M dynamics converge to online SGD.
Preconditioning methods such as unit normalization are shown to stabilize convergence in high-dimensional, noisy environments.

High-Dimensional Limit Theorems for SGD: Momentum and Adaptive Step-Sizes

The paper "High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes" (2511.03952) explores the high-dimensional behavior of Stochastic Gradient Descent (SGD) with Polyak Momentum (SGD-M) and adaptive step-sizes. The aim is to provide a scaling limit framework that helps compare online SGD with its popular variants.

Introduction

SGD and its variants, like momentum-based adjustments and adaptive step-sizes, form the backbone of large-scale optimization in machine learning. The high-dimensional setting introduces distinct dynamics as the problem dimension increases alongside the step-size. In this regime, scaling laws and limit theorems become pivotal to understanding and optimizing the behavior of these algorithms across complex data landscapes.

Contributions

The paper extends previous works on the effective dynamics of SGD to accommodate momentum methods and adaptive step-sizes. Notably, it establishes when the dynamics of SGD-M converge to those of conventional online SGD, given appropriate time rescaling and step-size choices. Conversely, due to high-dimensional effects, incorrectly choosing step-sizes for SGD-M may degrade performance compared to online SGD.

Two case studies—Spiked Tensor PCA and Single Index Models—demonstrate the derived limits. Interestingly, examples of adaptive step-size SGD further stabilize learning dynamics, illustrating how preconditioners can mitigate high-dimensional effects, such as exploding or vanishing gradients.

Methodology

The authors compute scaling limits and effective dynamics by analyzing the stochastic Gaussian approximations across iterative variables of the learning algorithms. For SGD-M, the incorporation of momentum complicates the dynamics, requiring meticulous adjustments at the level of gradient estimations and preconditioner designs.

Results

For Spiked Tensor PCA, the analysis reveals conditions under which SGD-M dynamics coincide with online SGD. The adaptation through SGD-U (unit-normalized gradients) shows potential improvements in convergence stability, particularly under challenging noise models. This illustrates that preconditioning with normalization can broaden the range of admissible step-sizes that promote convergence.

Similarly, in Single Index Models, the paper demonstrates how, under an adaptive scheme, the dynamics align closer to population minima, underscoring advantages in practical learning scenarios where data dimensions are large.

Figure 1: Matrix PCA in dimension n = 10000 for lambda values showcasing different behavior for SGD-U and SGD-M. Each lambda corresponds to a particular noise regime affecting alignment.

Limitations

While the theoretical framework is robust, there are approximations inherent to the finite dimensional truncations of the theoretical models. These may diverge slightly from empirical observations in datasets with non-standard distributions or extreme outliers.

Future Work

Future research directions include expanding the current models to capture a wider array of problem domains, such as generative modeling and reinforcement learning, where high-dimensional scaling limits could unveil new insights. Additionally, exploring other types of adaptive preconditioners that can dynamically adjust to the data's statistical properties could provide richer scalability and robustness in training modern deep networks.

Conclusion

The research underscores key insights in the deployment of momentum and adaptive algorithms in high-dimensional settings. By characterizing when and how these methods align with their online or batch counterparts, the work provides practitioners with guidelines to enhance performance and stability using scaling insights. Ultimately, the framework posited offers a potential pathway to systematically improve the implementation of SGD variants in real-world high-dimensional problems.

Markdown Report Issue