Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models
Published 6 Nov 2025 in cs.LG, math.OC, and stat.ML | (2511.03972v2)
Abstract: An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time convergence bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using uniform stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path yields tighter stability bounds.
The paper establishes non-asymptotic convergence guarantees for the stochastic Gauss-Newton method, offering finite-time bounds that do not rely on strict positive definiteness.
It derives stability-based generalization bounds via algorithmic stability, showing improved performance when the Gauss-Newton matrix has a large minimum eigenvalue.
The analysis employs a novel variable-metric approach with a time-varying Lyapunov function to capture the stochastic and path-dependent dynamics in deep networks.
Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models
Introduction
The theoretical understanding of the impact of higher-order optimization methods, specifically Gauss-Newton methods, on the generalization capabilities of deep learning models is crucial yet remains elusive. The stochastic Gauss-Newton method with Levenberg-Marquardt damping presents a promising optimization framework for training overparameterized deep neural networks. This paper provides non-asymptotic, finite-time optimization and generalization bounds for this method, elucidating its behavior in overparameterized regimes.
Main Contributions
Non-Asymptotic Convergence Guarantees
The paper establishes that the Stochastic Gauss-Newton method converges at a non-asymptotic rate of:
O(k1[rˉklogk+λ+λ−1]+m1),
where k is the iteration count, B the batch size, m the network width, and λ the damping factor. This result is significant because it does not rely on the strict positive definiteness of the neural tangent kernel, indicating robust performance even in ill-conditioned scenarios and avoiding dependence on the sample size n.
Stability-Based Generalization Bounds
Generalization bounds are established through algorithmic stability, revealing a favorable regime for Stochastic Gauss-Newton when the Gauss-Newton matrix displays a large minimum eigenvalue along the optimization path. The bounds quantify the impact of model complexity, batch size, and curvature, aligning with empirical observations that larger models in overparameterized settings can improve generalization.
Technical Analysis
The paper introduces a novel variable-metric analysis, distinguishing it from traditional function space analyses. This approach captures the stochastic and path-dependent nature of the Gauss-Newton method, leveraging a time-varying Lyapunov function to establish bounded optimization paths even in the wide-network regime. The stochastic Gauss-Newton preconditioner, which adapts through mini-batch sampling, plays a key role in maintaining algorithmic stability and achieving effective generalization.
Implications and Future Directions
This work provides theoretical foundations for the convergence and generalization of Gauss-Newton methods in overparameterized deep learning, showing potential for other second-order methods. Future research could extend these insights to more general preconditioned optimization techniques or explore their application to different neural architectures. By identifying regimes where higher-order methods enhance generalization, this research informs practical selections of optimization strategies in overparameterized learning scenarios.
Conclusion
The Stochastic Gauss-Newton method, with its robust convergence and favorable generalization properties, represents a promising direction for training deep neural networks. This paper's contributions offer a deeper theoretical understanding of these methods, with practical implications for their application in complex learning tasks. The developed bounds and insights guide effective utilization of Gauss-Newton methods in scenarios characterized by high model capacity and intricate curvature dynamics.