Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks
Published 3 Apr 2026 in cond-mat.dis-nn, cond-mat.stat-mech, and stat.ML | (2604.03068v1)
Abstract: We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension $N$ and the number of samples $M$ diverge at fixed ratio $α= M/N$, and for finite hidden widths $(p,p*)$ of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization ($p>p*$) only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for $p>1$. From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.
The paper demonstrates that one-pass SGD escapes plateau regimes, with escape times primarily linked to teacher width rather than student overparameterization.
It employs an ODE formalism to capture norm learning, plateau dynamics, and convergence to a high-dimensional zero-loss manifold arising from continuous symmetries.
The study reveals that implicit bias drives the selection of the zero-loss solution closest to initialization, impacting generalization error and double-descent behavior.
Escape Dynamics and Implicit Bias of One-Pass SGD in Overparameterized Quadratic Networks
Overview
This work investigates the online learning dynamics of two-layer neural networks with quadratic activations under a teacher-student protocol, focusing on the high-dimensional regime where the number of inputs N and training samples M diverge at fixed ratio α=M/N, with finite student and teacher widths p and p∗, respectively. The primary contributions include a rigorous analysis of the escape from initial plateaus of poor generalization, the geometric characterization of the loss landscape, and the manifestation of implicit bias in the selection of solutions by one-pass stochastic gradient descent (SGD).
Model Formulation
The student and teacher networks both operate with quadratic activation functions and are represented as
with inputs drawn i.i.d. from the N-dimensional standard normal distribution, and weights initialized with orthonormality and small random overlaps across student and teacher perceptrons. The loss is given by squared error across samples.
The dynamics considered is strictly online: one-pass SGD where each training sample is used exactly once, and weight updates are indexed by the sample index.
Dynamical Analysis: ODE Formalism
In the limit N,M→∞, with α=M/N finite, population-level behavior is captured by closed ODEs governing the evolution of the teacher-student (ρ) and student-student (M0) overlap matrices. Analytical and numerical integration of these ODEs allows the dissection of the learning process into distinct regimes:
Norm Learning: In the initial phase, student nodes' norms equilibrate rapidly to a fixed point, independent of overlap with the teacher.
Plateau Regime: After norm stabilization, overlaps remain small and the generalization error reduction stagnates — the system is trapped in a flat, saddle-like region of the landscape.
Escape Dynamics: Overparameterization (M1) only weakly accelerates escape from the plateau, affecting the prefactor of the exponential approach to better solutions, but not the intrinsic timescale, which is dominated by M2 (the teacher's width).
Convergence to Zero-Loss Manifold: For sufficiently large sample sizes, the network enters a regime with zero generalization error. Unlike single-index models, for M3 and M4 the set of zero-loss solutions forms a high-dimensional continuous manifold, associated with the model's internal symmetries.
Loss Landscape Geometry
Landscape analysis via the population loss function reveals:
Critical Points: Several classes exist: trivial (all-zero weights), uncorrelated student-teacher configurations (plateau), and perfectly correlating configurations (global minima).
At the plateau, the Hessian exhibits a large fraction of null eigenvalues and at least one negative direction, confirming the presence of extended flat saddle regions that hinder learning progress.
At zero-loss solutions, for M5, the Hessian develops a high-dimensional kernel corresponding to marginal directions, which grow combinatorially with M6 and M7. These directions are consequences of the continuous rotational symmetry of the quadratic model for M8.
Additional null directions specific to M9 (overparameterized student) are present that are not generated by symmetry transformations, reflecting genuine proliferation of flat directions as overparameterization increases.
Implicit Bias: Solution Selection
Despite the high degeneracy of zero-loss solutions, the online SGD implicitly selects the manifold element closest to initialization in Euclidean weight space. This selection is enforced by matrix-valued conservation laws within the ODEs, compatible with theoretical frameworks from Noether's theorem for learning with continuous symmetries. Thus, solution selection arises purely from initialization and symmetry constraints, not from explicit regularization or dataset-dependent noise.
This mechanism is demonstrably robust: trajectories initialized with different random overlaps consistently converge to the closest compatible zero-loss solution, with analytic expressions predicting both the final configuration and the conservation principle along the dynamics.
Implications and Directions
Theoretical Significance
Limited Power of Overparameterization: In sharp contrast to linear and deep ReLU networks where more parameters can dramatically accelerate learning and improve generalization, here, increasing α=M/N0 primarily reduces escape time by a multiplicative factor without impacting the underlying timescale, which stays dominated by the effective number of teacher features α=M/N1. The plateau phenomenon thus arises from landscape geometry rather than capacity limitations.
Symmetry-Induced Marginal Stability: The existence of an extensive marginal manifold for zero-loss solutions, even at the interpolation threshold α=M/N2, is a direct consequence of continuous symmetries, not of parameter redundancy. Overparameterization further enhances the number of flat directions.
Implicit Bias and Self-Averaging: In the strongly overparameterized regime, selection among zero-loss solutions is governed by initial condition variance, which is a principal contributor to generalization error near the interpolation point. Incremental increase in α=M/N3 self-averages this initialization dependence, providing an analytic foundation for the double-descent phenomenon.
Practical Implications
Solution Diversity and Initialization Sensitivity: Training routines should be aware that unconstrained quadratic models with α=M/N4 do not recover teacher parameters uniquely; instead, functionally-equivalent solutions are chosen by initialization.
Loss Geometry Engineering: Flat regions and lack of strong convexity at global minima are a structural feature; early stopping or additional regularization may be necessary to control implicit bias in finite data regimes or to avoid overfitting certain initialization paths.
Generalization Error: In online (population-risk minimizing) settings, the model provides a clean test-bed for understanding how generalization error decomposes into data and initialization-induced variability.
Future Directions
Extensions to finite-sample settings and finite-batch SGD are suggested as natural directions, aiming to connect these findings to empirically observable double-descent phenomena. The impact of symmetry-induced ambiguity on specialization transitions and the broader implications for loss landscape geometry in deep and structured models are also highlighted as open problems.
Conclusion
The analysis demonstrates that for two-layer quadratic neural networks, escape from plateaus of poor generalization under one-pass SGD is marginally influenced by overparameterization, while the symmetry-induced degeneracy in the solution space fundamentally shapes both training dynamics and implicit bias. Marginal stability and solution selection by minimal distance to initialization are established as generic outcomes, with significant implications for the geometry of the loss landscape and the design of scalable, interpretable learning protocols in wider classes of symmetric overparameterized models.
Reference:
Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks (2604.03068)