From Cursed to Competitive: Closing the ZO-FO Gap via Input-to-State Stability

Published 28 Apr 2026 in math.OC, cs.LG, eess.SY, and math.NA | (2604.25372v2)

Abstract: While it is generally understood that zeroth-order (ZO) algorithms have an extra dependency on their number of iterations for any choice of parameters, compared to their first-order (FO) counterparts, in this work, we show that under several conditions, in expectation, ZO methods do not suffer from extra dimension dependencies in their convergence rates with respect to their FO counterparts. We look at optimisation algorithms from the dynamical systems perspective and analyse the conditions under which one can formulate the average of a ZO algorithm as the average of its FO counterpart with bounded perturbations with values dependent on design parameters. Then, using input-to-state stability properties, we show ZO methods follow the same decay rate as their FO counterparts and converge to a neighbourhood of the fixed point of FO methods, where its radius depends on the bound of the norm of the perturbations, which can be made arbitrarily small. The theoretical findings are illustrated via numerical examples.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper's main contribution is establishing that ISS permits zeroth-order methods to match FO iteration complexity by controlling perturbations via parameter tuning.
The methodology recasts optimization algorithms as dynamical systems, enabling precise error bounds independent of problem dimension except in the convergence neighbourhood.
Empirical results on quadratic and neural network tasks validate that zeroth-order approaches, with proper smoothing, achieve performance nearly identical to first-order methods.

Closing the Zeroth-Order / First-Order Optimization Gap via Input-to-State Stability

Introduction

The paper "From Cursed to Competitive: Closing the ZO-FO Gap via Input-to-State Stability" (2604.25372) challenges the widespread assumption that zeroth-order (ZO) optimization methods incur intrinsic, unavoidable complexity penalties with respect to the ambient problem dimension, relative to first-order (FO) methods. By reinterpreting ZO algorithms through the framework of dynamical systems and leveraging the machinery of input-to-state stability (ISS), the authors demonstrate that, under appropriate conditions, ZO methods can inherit the iteration complexity and decay rates of their FO analogues, with the only remaining difference being a perturbation-induced equilibrium neighbourhood whose radius is controllable and independent of the number of optimization iterations.

Background and Motivation

Zeroth-order optimization, relying only on function evaluations, is crucial in settings where gradients are unavailable or expensive—from adversarial attacks, black-box reinforcement learning, and simulation-based design, to large-scale hyperparameter or memory-efficient adaptations of deep models. Classical analyses (cf. [nesterov_random_2017], [ghadimi2013stochastic], [duchi2015optimal]) demonstrate an $\mathcal{O}(n)$ (dimension-dependent) penalty on iteration complexity for ZO methods, with additional work amortizing this cost under sparsity or effective-dimension structures ([wang2018stochastic], [yue2023zeroth]).

This paper revisits the supposed "dimensional curse," arguing instead that, with correct parameterization and a dynamical systems view, the iteration complexity can be made independent of dimension; only the radius of the convergence neighbourhood has explicit $n$ dependence.

Main Results

Dynamical Systems View and Input-to-State Stability

The central insight is to represent both FO and ZO algorithms as discrete-time dynamical systems. The deterministic update for a FO algorithm is expressed as $z_{k+1} = w(z_k)$ , while the ZO version with randomized gradient surrogates becomes $z_{k+1} = w(z_k) + q_k$ , where $q_k$ encodes the ZO estimation error/bias and acts as an additive perturbation. Given that ZO estimators can be made unbiased in expectation (using Gaussian smoothing, as in [nesterov_random_2017]), the expectation of the ZO update closely mirrors the FO system, with $q_k$ capturing both smoothing bias and finite-sample variance.

The authors apply the theory of input-to-state stability (ISS, see [sontag1989smooth], [jiang2001input], [kellett2023introduction]) to establish that if the unperturbed FO system is (locally exponentially or asymptotically) stable and the perturbations are uniformly bounded, then the perturbed system (the averaged ZO dynamics) converges with the same decay rate to a neighbourhood whose radius depends solely on the perturbation bound—not on the iteration count or the dimension directly.

ZO as Bounded Perturbation of FO

Detailed analyses are presented for:

Gradient Descent (GD): For $\beta$ -strongly convex and $L$ -smooth objectives, the expected averaged ZO-GD iteration can be decomposed as the sum of the FO step and a bounded, parameter-controllable perturbation, with explicit upper bounds on $q_k$ involving $h$ (step size), $n$ 0 (smoothing parameter), $n$ 1 (samples per iteration), and $n$ 2 (oracle variance).
Momentum Methods (Heavy Ball and Nesterov's Accelerated Gradient, NAG): Both methods are recast in an augmented state-space; the deviation between ZO and FO iterates is bounded via similar arguments. Precise recursions are given for the error and its propagation through momentum, characterizing the effect of the smoothing and oracle noise.
$n$ 3 Regularization for Non-Strongly-Convex Objectives: The contraction needed for ISS can be restored by $n$ 4 regularization, even when strong convexity fails. The unregularized minimizer proximity is traded off against contraction strength via the regularization parameter.

The dimension-dependence appears only in the neighbourhood size (specifically, smoothing bias is $n$ 5), not in the decay rate, which is inherited verbatim from the FO dynamics.

ISS Guarantees: Iteration Complexity Is Dimension-Free

Theoretically, for any GD, HB, or NAG system stable under standard conditions (strong convexity, appropriate choice of $n$ 6, $n$ 7, $n$ 8), the iteration complexity for ZO methods with parameters yielding sufficiently small perturbations is identical to that of FO methods. The explicit $n$ 9 iteration count penalty in the classical ZO literature is thus shown to be an artefact of prior, cruder analyses. The only irreducible $z_{k+1} = w(z_k)$ 0-dependence occurs in the neighbourhood radius, which can be made arbitrarily small with appropriate parameter scaling.

Numerical Results

Two experimental domains reinforce the theoretical claims:

Strongly Convex Quadratic: For $z_{k+1} = w(z_k)$ 1, GD and ZO-GD are compared over varying $z_{k+1} = w(z_k)$ 2. Provided perturbations stay within the ISS-stabilized regime, the trajectory of ZO-GD shadows GD, with convergence rate intact and no dimension-induced slowdowns.
Neural Network Classification (Nonconvex Case with $z_{k+1} = w(z_k)$ 3 Reg.): A two-layer ReLU network ( $z_{k+1} = w(z_k)$ 4) on a binary MNIST task is trained with GD, HB, NAG and their ZO counterparts. ZO optimization tracks FO closely in both loss and classification performance, with empirical iteration ratios (ZO/FO) near $z_{k+1} = w(z_k)$ 5 across a range of dimensions ( $z_{k+1} = w(z_k)$ 6 to $z_{k+1} = w(z_k)$ 7).

The per-iteration computational overhead in ZO (from extra function evaluations) is dominated by variance control—oracle averaging ( $z_{k+1} = w(z_k)$ 8)—and is not intrinsic to the optimization dynamics.

Implications and Theoretical Significance

This work effectively reinterprets the "ZO curse of dimensionality": iteration complexity is not inherently dimension-cursed; rather, it is the attainable accuracy (final neighbourhood) that scales with dimension due to the smoothing bias. Asymptotic convergence can be made arbitrarily close to that of FO methods, provided computational resources are allocated to decrease $z_{k+1} = w(z_k)$ 9 and increase $z_{k+1} = w(z_k) + q_k$ 0 appropriately. Importantly, the same ISS-based argument applies automatically to a wide class of iterative optimizers, removing the need for algorithm- and problem-specific ZO analyses.

Practically, this substantially lowers the perceived gap between ZO and FO when gradients are unavailable, supporting the empirical successes of high-dimensional ZO approaches in large-scale fine-tuning ([malladi2023fine], [zhang2024revisiting]).

Potential Extensions and Future Directions

Key directions include:

Extension to nonconvex, nonsmooth optimization, leveraging Lyapunov/ISS tools for broader equilibria and attractors.
High-probability and finite-time deviation guarantees, tightening the link between expectation-based and pathwise analysis.
Data-driven and adaptive schemes for dynamic parameter tuning of $z_{k+1} = w(z_k) + q_k$ 1 and $z_{k+1} = w(z_k) + q_k$ 2, automating the trade-off between accuracy and iteration cost.

The framework is modular and poised for further generalizations, including constrained and stochastic settings.

Conclusion

By recasting ZO optimization within the ISS paradigm, this work (2604.25372) reframes the longstanding narrative of dimension-dependent inefficiency. ZO methods, when correctly parameterized, inherit the convergence rates of their FO counterparts in expectation, with the only gap being a neighbourhood radius tunable by algorithmic parameters. This has significant implications for theory and practice in derivative-free and black-box optimization, reinforcing the credibility and applicability of ZO algorithms in large-scale problems where gradient access is a luxury.

Markdown Report Issue