- The paper establishes a DMFT framework that reduces complex high-dimensional SGD dynamics to a low-dimensional stochastic process.
- It leverages a mapping from discrete SGD to continuous-time SDEs, capturing finite-batch noise and nonlinear model effects.
- Empirical validations on logistic regression confirm convergence properties and strong alignment between predicted and actual error dynamics.
Dynamical Mean-Field Theory for Stochastic Gradient Flows in High Dimensions
Introduction and Problem Setting
The paper "High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory" (2602.06320) establishes a unified analytical framework for the high-dimensional dynamics of Stochastic Gradient Descent (SGD), particularly in the regime of multi-pass training with small batch sizes and nonlinear models. Current theoretical approaches to SGD dynamics often either restrict analysis to one-pass (online) learning, to proportionally large batch sizes, or to linear models, leaving the practically relevant setting of multi-pass SGD with small batches and nonlinearities largely uncharacterized at the macroscopic level.
The authors address this gap by focusing on the continuous-time stochastic differential equation (SDE) limit of discrete-time SGD—termed the Stochastic Gradient Flow (SGF)—in the proportional high-dimensional limit where both sample size n and parameter dimensionality d diverge with fixed ratio δ=n/d. Leveraging dynamical mean-field theory (DMFT), they derive and prove the validity of a low-dimensional stochastic process that governs the empirical distributional dynamics of SGF parameters in this regime.
Theoretical Framework: Stochastic Gradient Flow and DMFT
The SGD update with learning rate η and batch size B is approximated by its continuous-time counterpart (SGF) with a "temperature" parameter τ=η/B, reflecting noise intensity due to finite stochastic mini-batches. Under suitable scaling (B=o(n), n,d→∞ with n/d→δ), the SGF increment's first and second moments are matched to those of SGD. This SDE encapsulates both the deterministic drift of the gradient flow and Gaussian noise from stochastic gradients.
The main technical achievement is the derivation of an associated DMFT: a closed system of stochastic integro-differential equations for the effective coordinate processes (parameters and predictions), together with their correlation and response functions. The macroscopic state is thus described not by d-dimensional trajectories but by a finite-dimensional stochastic process with self-consistently determined statistics, which captures the collective behavior of all coordinates.
Crucially, this DMFT generalizes and interpolates between previously isolated theories:
- In the noise-free case (Ï„=0), recovers gradient flow results as in [celentano2025highdimensionalasymptoticsfirst].
- In the infinite-data/online limit (δ→∞), matches known low-dimensional ODE/SDE characterizations for online SGD, e.g., [benarous2022highdimensionallimittheorems, goldt2019dynamicsstochasticgradient].
- For linear models, the DMFT reduces to Volterra-type equations previously derived by homogenized SGD (HSGD) approaches [paquette2025homogenizationsgdhighdimensions].
This is formalized and proved through a combination of discretization, approximate message passing (AMP) mapping, stochastic calculus, and the use of Stein's lemma for the stochastic correction terms.
Characterization Results
Two main theorems are established. First, the existence and uniqueness of a bounded DMFT solution is proved for sufficiently short times, and globally for either vanishing SGD noise (Ï„=0) or linear loss settings. Second, it is shown that the joint empirical distribution of coordinate and prediction trajectories of SGF converges, in Wasserstein-2 distance, to the law of the effective DMFT process for any fixed time horizon (with convergence uniform over time discretizations).
The DMFT, in precise terms, yields a self-consistent stochastic process for effective coordinates θt (in Rm for m-dimensional parameter blocks), with macroscopic observables determined by correlation Cθ​,Cℓ​ and response Rθ​,Rℓ​ functions, each defined through pathwise averages and functional derivatives with respect to certain Gaussian processes. For nonlinear models and small batch sizes, this construction was previously an open theoretical question.
Application to Special Cases and Empirical Validation
The analytical flexibility of the DMFT framework is demonstrated through several specializations:
- Infinite data/online limit: The DMFT for SGF simplifies to a low-dimensional SDE with self-averaging noise, consistent with established results for online/one-pass SGD in high dimensions.
- Planted model/teacher-student: The macroscopic DMFT system is extended to accommodate supervised learning with noisy targets generated by a ground-truth parameter, relevant for both GLMs and two-layer neural networks.
- Linear regression: The DMFT reduces to linear Volterra integral equations governing the evolution of train and test error, allowing an explicit solution via spectral integrals against the Marchenko–Pastur distribution. This exactly matches results from random matrix/HSGD approaches (see also [paquette2021sgdlargeaveragecase]).
Empirical simulations are conducted for multi-pass SGD on high-dimensional logistic regression, comparing the observed training and test error dynamics with those predicted by the DMFT. The agreement is strong over a range of temperature parameters, both in qualitative evolution and quantitative magnitude.
(Figure 1)
Figure 1: Train (left) and test (right) error dynamics of SGD for logistic regression with various temperature values τ=η/B. (Solid) Average errors of 10 trials of SGD with d=1024 and n=2048. Shaded regions represent one standard deviation. (Dotted) Predictions from the DMFT equation.
Relation to Prior Work and Theoretical Significance
The DMFT framework rigorously connects, and often unifies, a broad family of previously disparate analyses of optimization dynamics in large models—ranging from spin-glass systems in physics, online-learning ODEs, to explicit random matrix approaches for linear estimation. It clarifies that the dimension reduction to (stochastic) scalar processes and the persistence of nontrivial noise in finite temperature limits both arise naturally in the proportional high-dimensional regime, with finite aspect ratio and small batch size scaling.
Concurrently and independently, [fan2026highdimensionallearningdynamics] derived a similar, but not identical, DMFT system for the SGD dynamics. Notably, the framework in (2602.06320) admits broader classes of (potentially unbounded) nonlinearities, facilitating rigorous treatment of linear regression and certain neural network models.
This theoretical foundation opens new avenues for future research:
- Analytical study of nontrivial steady states, phase transitions (such as double descent), and generalization in nonlinear SGF/SGD dynamics.
- Long-time behavior and non-stationary regimes in nonlinear high-dimensional models, paralleling recent advances for gradient flows.
- Application to understanding implicit regularization and generalization in modern deep networks under practical training regimes.
Conclusion
This work rigorously establishes that in high-dimensional, multi-pass SGD with small batches and nonlinearities, macroscopic dynamics are precisely captured by a low-dimensional DMFT characterized process. This clarifies the role of stochasticity, batch size, and dimensionality in shaping solution properties—both theoretically and empirically—while also unifying prior analytical approaches under a common framework. The tools and results laid out by this analysis provide a systematic language for the study and optimization of large-scale learning algorithms, offering predictive power and analytical insight for both classical estimation and modern overparameterized models (2602.06320).