Non-Convex Optimization: Asymptotic Convergence

Updated 4 January 2026

Asymptotic convergence in non-convex settings is defined by how iterative methods approach stationary points using stochastic, momentum, and adaptive techniques despite the absence of convexity.
It examines convergence phenomena including almost-sure and mean-square convergence with quantified rates in structured landscapes and under various hyperparameter schemes.
The framework relaxes classical assumptions, validating practical approaches like batch size scheduling and adaptive step-size control in deep learning applications.

Asymptotic convergence in non-convex optimization settings describes the theoretical properties ensuring that iterates generated by algorithms (e.g., SGD, adaptive gradient methods, or momentum-based schemes) approach stationary points or minimizers of the objective, despite the absence of convexity and, often, smoothness. This regime encompasses phenomena such as almost-sure convergence, mean-square convergence, and rates characterizing proximity to critical points under diverse assumptions on the objective, noise, and algorithmic hyperparameters. Modern research quantifies these properties for stochastic and adaptive methods, handling both structure-rich landscapes (sector-bounded, KL geometry) and algorithmic variants (constant step, increasing batch size, projection). Below is a rigorous account of the principal theoretical findings and frameworks in asymptotic convergence for non-convex settings.

1. Function Classes, Geometric and Subdifferential Notions

Several non-convex function classes admit global asymptotic convergence analysis:

Sector-bounded gradient class $\mathcal{F}_{m,L}^1$ : Functions $f:\mathbb{R}^n \to \mathbb{R}$ for which there exist $0<m\leq L$ and $x^*$ such that $(m(x-x^*)-\nabla f(x))^\top (L(x-x^*)-\nabla f(x))\leq 0$ for all $x$ . This class strictly contains $C^2$ functions satisfying $mI\succeq \nabla^2 f(x^*) \preceq LI$ , thus permitting nonconvex behavior away from the minimizer (Ugrinovskii et al., 2022).
Sector-wise subdifferentials and stationary measures: For constrained problems, stationarity is characterized not merely by the gradient norm, but by its distance to generalized objects such as the Goldstein subdifferential $\overline{\partial}_\epsilon I_\mathcal{X}$ , which captures constraint-induced criticality and generalizes Moreau envelope criteria (Zheng et al., 3 Oct 2025).
Composite and differential inclusion models: Objective classes like composite functions $\sum_p \phi_p(f_p(x))$ with inner functions as epi-limits of difference-of-convex functions extend amenable function theory to non-Lipschitz, non-smooth problems. Stationarity is formalized via A-stationarity, using asymptotic subdifferential constructions $\partial_a f(x)$ (Li et al., 2023). For locally Lipschitz objectives, the Clarke subdifferential governs criticality and differential inclusion trajectory limits (Bianchi et al., 2020).

2. Stochastic Gradient and Momentum Methodologies

Algorithmic frameworks exhibit distinct asymptotic behaviors depending on the employed update scheme and parameterization:

Heavy-Ball (HB) and Triple-Momentum: The HB method’s update $x_{k+1} = x_k - \alpha \nabla f(x_k) + \beta (x_k - x_{k-1})$ , under precisely characterized $(\alpha,\beta)$ in the region $\mathcal{G} = \{0\leq\beta<1, 0<\alpha<\bar\alpha(\beta)\}$ , is globally convergent for all $f \in \mathcal{F}_{m,L}^1$ . The asymptotic rate is exactly quantified by minimizing a closed-form $R$ -factor over $\mathcal{G}$ , outperforming the Triple-Momentum Method for condition numbers up to $\kappa \approx 7.97$ (Ugrinovskii et al., 2022).
Quasi-Hyperbolic Momentum (QHM) and Batch Size Scheduling: For general smooth non-convex objectives, QHM iterates with mini-batch stochasticity and parameters $\nu,\beta$ exhibit asymptotic convergence of $\mathbb{E}\|\nabla f(x_t)\|$ to zero provided either the learning rate decays or batch size grows so that $\sum_{t=0}^{\infty} \eta_t^2 / b_t<\infty$ . Growing the batch size while holding learning rates constant is preferable for non-asymptotic rates (Imaizumi et al., 30 Jun 2025).
SGD under relaxed step-size schemes: For unconstrained, non-convex problems, SGD with step-sizes $\{\epsilon_t\}$ satisfying $\sum\epsilon_t=\infty$ , $\sum\epsilon_t^p<\infty$ for some $p>2$ yields almost sure and $L_2$ convergence to stationary points, even in the absence of global Lipschitz continuity or strong boundedness on higher moments. The stopping-time technique supersedes classical Robbins-Monro requirements (Jin et al., 17 Apr 2025, Fest et al., 2023).
Adaptive algorithms (Adagrad, RMSProp, Adam): In the single time-scale regime, algorithms adapting step-sizes via per-coordinate accumulators or exponential smoothing (Adagrad, RMSProp) achieve almost-sure convergence to critical points under mild smoothness, noise, and stability conditions—the ODE method and Lyapunov descent are central proof devices. For Adam, a precisely tuned constant step size, $\alpha = \sqrt{2(f(\theta_0)-f^*)/(K \delta^2 T)}$ , guarantees stationarity at rate $O(T^{-1/4})$ with minimal conditions (Gadat et al., 2020, Mazumder et al., 2023, Jin et al., 2024).

3. Precise Convergence Criteria and Rates

Asymptotic convergence in non-convex settings is quantified by:

Stationarity via gradient norm or subdifferential distance:
- Unconstrained: $\lim_{t\to\infty}\|\nabla f(x_t)\|=0$ (almost surely, in expectation, and/or in probability) (Jin et al., 17 Apr 2025, Gadat et al., 2020).
- Constrained/Projected: $\lim_{N\to\infty} (1/\tau_N)\sum_{k=0}^{N-1}\alpha_k \mathbb{E}[\operatorname{dist}(-\nabla\bar f(x_k),\overline{\partial}_{b_k} I_\mathcal{X}(x_k))^2] = 0$ (Zheng et al., 3 Oct 2025).
Convergence rates:
- Momentum and adaptive methods under decaying learning rates or growing batch sizes attain $O(1/\sqrt{N})$ or $O(1/N)$ in expectation for $\mathbb{E}\|\nabla f(x_T)\|^2$ (Imaizumi et al., 30 Jun 2025, Gadat et al., 2020, Jin et al., 2024).
- Under sector-boundedness or KL geometry, accelerated methods (HB, Polyak) and stochastic KL descent attain sublinear to linear rates dependent on structural exponents and condition numbers (Ugrinovskii et al., 2022, Fest et al., 2023).
Differential inclusion and invariant measures: With stochasticity and nonsmoothness, small step-size constant SGD iterates concentrate around Clarke-critical sets; their time-interpolated paths converge in probability to solutions of $\dot{x}\in-\partial^{(C)}F(x)$ , and stationary distributions of the associated Markov chain converge weakly to invariant distributions for the differential inclusion (Bianchi et al., 2020).

4. Relaxing Classical Structural Assumptions

Recent convergence analyses have significantly relaxed assumptions necessary for asymptotic results:

Step-size control: Rather than requiring strict square-summable decrease, any non-increasing step-size sequence with $\sum \epsilon_t = \infty$ and $\sum \epsilon_t^p < \infty$ for $p > 2$ suffices; this includes practical schedules $\epsilon_t = t^{-\alpha}$ with $\alpha \in (1/p, 1]$ (Jin et al., 17 Apr 2025).
Noise and smoothness: Convergence results are achieved with only local $p$ -th moment control near criticality, without global boundedness, and often even weaker conditions such as affine variance or weak growth; global Lipschitz continuity is non-essential if local regularity and coercivity are enforced (Jin et al., 2024, Gadat et al., 2020).
Stochastic processes and mixing: Both IID and mixing (e.g., $L$ -mixing) data permit the same asymptotic guarantees for projected SGD with non-convex losses (Zheng et al., 3 Oct 2025).

5. Constrained and Composite Non-Convex Frameworks

Algorithms for constrained or composite objectives utilize refined subdifferential and stationarity constructs:

Framework	Stationarity Criterion	Applicable Algorithms
Sector-bounded $\mathcal{F}_{m,L}^1$	Gradient norm at minimizer	HB, TMM (Ugrinovskii et al., 2022)
Projected SGD (Goldstein)	$\operatorname{dist}(-\nabla\bar f(x), \overline{\partial}_\epsilon I_\mathcal{X}(x))$	Projected SGD (Zheng et al., 3 Oct 2025)
Composite ADC problems	A-stationarity: asymptotic subdifferential inclusion	Prox-ADC, DC methods (Li et al., 2023)
Non-smooth, locally Lipschitz	Clarke subdifferential criticality	Constant-step SGD (Bianchi et al., 2020)

Asymptotic convergence can be rigorously established for accumulation points of iterates generated by double-loop prox-linear majorization or time-rescaled stochastic processes; stationarity is met in the sense appropriate to the nonsmooth, composite, or constrained landscape.

6. Implications for Practical Deep Learning and Sampling

The theoretical advancements have direct impact on algorithm design and tuning strategies for non-convex landscapes:

Batch size scheduling: Exponentially growing batch sizes without decaying learning rates are validated as preserving fast convergence and stationarity in deep neural network optimization (Imaizumi et al., 30 Jun 2025).
Adaptive step-scheduling: AdaGrad and RMSProp with vanishing coordinate-wise steps achieve almost-sure and mean-square convergence without uniformly bounded gradients, matching empirical practice (Gadat et al., 2020, Jin et al., 2024).
Sampling via Langevin schemes: The modified Tamed Unadjusted Langevin Algorithm (mTULA) provides non-asymptotic bounds of $O(\lambda)$ in Wasserstein-1 and $O(\lambda^{1/2})$ in Wasserstein-2 for non-convex potentials without global convexity, with asymptotic exactness as $\lambda\to 0$ (Neufeld et al., 2022).

These methods enable improved theoretical justification for algorithms broadly adopted in deep learning and large-scale optimization problems, bridging the gap between robust theoretical convergence and settings encountered in high-dimensional, non-convex empirical regimes.

7. Future Directions and Open Challenges

Advancing asymptotic convergence theory in non-convex settings involves:

Framework generalization: Extension of stopping-time, KL, and conditional descent techniques to more complex adaptive and momentum optimizers (e.g., Adam, Rmsprop with fixed or adaptive smoothing).
Invariant measure characterization: Deeper exploration of ergodicity and long-term distributional properties in stochastic differential inclusions and Markov chains associated with nonsmooth objectives.
Composite, nonsmooth, and constrained landscapes: Further refinement of subdifferential and stationarity concepts (Goldstein, Clarke, A-stationarity) to unify convergence criteria across unconstrained, constrained, and composite settings.
Interaction with landscape geometry: Quantitative understanding of escape rates from saddles, plateaus, and local maxima via noise structure, algorithmic variants, and landscape regularity.

The convergence theory surveyed herein provides rigorous probabilistic and geometric foundations for algorithmic performance assessment in modern, high-dimensional non-convex optimization. Key results and algorithmic prescriptions continue to evolve, informed by empirical findings and the relaxation of structural and statistical assumptions in practical applications.