Papers
Topics
Authors
Recent
Search
2000 character limit reached

From Cursed to Competitive: Closing the ZO-FO Gap via Input-to-State Stability

Published 28 Apr 2026 in math.OC, cs.LG, eess.SY, and math.NA | (2604.25372v2)

Abstract: While it is generally understood that zeroth-order (ZO) algorithms have an extra dependency on their number of iterations for any choice of parameters, compared to their first-order (FO) counterparts, in this work, we show that under several conditions, in expectation, ZO methods do not suffer from extra dimension dependencies in their convergence rates with respect to their FO counterparts. We look at optimisation algorithms from the dynamical systems perspective and analyse the conditions under which one can formulate the average of a ZO algorithm as the average of its FO counterpart with bounded perturbations with values dependent on design parameters. Then, using input-to-state stability properties, we show ZO methods follow the same decay rate as their FO counterparts and converge to a neighbourhood of the fixed point of FO methods, where its radius depends on the bound of the norm of the perturbations, which can be made arbitrarily small. The theoretical findings are illustrated via numerical examples.

Summary

  • The paper's main contribution is establishing that ISS permits zeroth-order methods to match FO iteration complexity by controlling perturbations via parameter tuning.
  • The methodology recasts optimization algorithms as dynamical systems, enabling precise error bounds independent of problem dimension except in the convergence neighbourhood.
  • Empirical results on quadratic and neural network tasks validate that zeroth-order approaches, with proper smoothing, achieve performance nearly identical to first-order methods.

Closing the Zeroth-Order / First-Order Optimization Gap via Input-to-State Stability

Introduction

The paper "From Cursed to Competitive: Closing the ZO-FO Gap via Input-to-State Stability" (2604.25372) challenges the widespread assumption that zeroth-order (ZO) optimization methods incur intrinsic, unavoidable complexity penalties with respect to the ambient problem dimension, relative to first-order (FO) methods. By reinterpreting ZO algorithms through the framework of dynamical systems and leveraging the machinery of input-to-state stability (ISS), the authors demonstrate that, under appropriate conditions, ZO methods can inherit the iteration complexity and decay rates of their FO analogues, with the only remaining difference being a perturbation-induced equilibrium neighbourhood whose radius is controllable and independent of the number of optimization iterations.

Background and Motivation

Zeroth-order optimization, relying only on function evaluations, is crucial in settings where gradients are unavailable or expensive—from adversarial attacks, black-box reinforcement learning, and simulation-based design, to large-scale hyperparameter or memory-efficient adaptations of deep models. Classical analyses (cf. [nesterov_random_2017], [ghadimi2013stochastic], [duchi2015optimal]) demonstrate an O(n)\mathcal{O}(n) (dimension-dependent) penalty on iteration complexity for ZO methods, with additional work amortizing this cost under sparsity or effective-dimension structures ([wang2018stochastic], [yue2023zeroth]).

This paper revisits the supposed "dimensional curse," arguing instead that, with correct parameterization and a dynamical systems view, the iteration complexity can be made independent of dimension; only the radius of the convergence neighbourhood has explicit nn dependence.

Main Results

Dynamical Systems View and Input-to-State Stability

The central insight is to represent both FO and ZO algorithms as discrete-time dynamical systems. The deterministic update for a FO algorithm is expressed as zk+1=w(zk)z_{k+1} = w(z_k), while the ZO version with randomized gradient surrogates becomes zk+1=w(zk)+qkz_{k+1} = w(z_k) + q_k, where qkq_k encodes the ZO estimation error/bias and acts as an additive perturbation. Given that ZO estimators can be made unbiased in expectation (using Gaussian smoothing, as in [nesterov_random_2017]), the expectation of the ZO update closely mirrors the FO system, with qkq_k capturing both smoothing bias and finite-sample variance.

The authors apply the theory of input-to-state stability (ISS, see [sontag1989smooth], [jiang2001input], [kellett2023introduction]) to establish that if the unperturbed FO system is (locally exponentially or asymptotically) stable and the perturbations are uniformly bounded, then the perturbed system (the averaged ZO dynamics) converges with the same decay rate to a neighbourhood whose radius depends solely on the perturbation bound—not on the iteration count or the dimension directly.

ZO as Bounded Perturbation of FO

Detailed analyses are presented for:

  • Gradient Descent (GD): For β\beta-strongly convex and LL-smooth objectives, the expected averaged ZO-GD iteration can be decomposed as the sum of the FO step and a bounded, parameter-controllable perturbation, with explicit upper bounds on qkq_k involving hh (step size), nn0 (smoothing parameter), nn1 (samples per iteration), and nn2 (oracle variance).
  • Momentum Methods (Heavy Ball and Nesterov's Accelerated Gradient, NAG): Both methods are recast in an augmented state-space; the deviation between ZO and FO iterates is bounded via similar arguments. Precise recursions are given for the error and its propagation through momentum, characterizing the effect of the smoothing and oracle noise.
  • nn3 Regularization for Non-Strongly-Convex Objectives: The contraction needed for ISS can be restored by nn4 regularization, even when strong convexity fails. The unregularized minimizer proximity is traded off against contraction strength via the regularization parameter.

The dimension-dependence appears only in the neighbourhood size (specifically, smoothing bias is nn5), not in the decay rate, which is inherited verbatim from the FO dynamics.

ISS Guarantees: Iteration Complexity Is Dimension-Free

Theoretically, for any GD, HB, or NAG system stable under standard conditions (strong convexity, appropriate choice of nn6, nn7, nn8), the iteration complexity for ZO methods with parameters yielding sufficiently small perturbations is identical to that of FO methods. The explicit nn9 iteration count penalty in the classical ZO literature is thus shown to be an artefact of prior, cruder analyses. The only irreducible zk+1=w(zk)z_{k+1} = w(z_k)0-dependence occurs in the neighbourhood radius, which can be made arbitrarily small with appropriate parameter scaling.

Numerical Results

Two experimental domains reinforce the theoretical claims:

  • Strongly Convex Quadratic: For zk+1=w(zk)z_{k+1} = w(z_k)1, GD and ZO-GD are compared over varying zk+1=w(zk)z_{k+1} = w(z_k)2. Provided perturbations stay within the ISS-stabilized regime, the trajectory of ZO-GD shadows GD, with convergence rate intact and no dimension-induced slowdowns.
  • Neural Network Classification (Nonconvex Case with zk+1=w(zk)z_{k+1} = w(z_k)3 Reg.): A two-layer ReLU network (zk+1=w(zk)z_{k+1} = w(z_k)4) on a binary MNIST task is trained with GD, HB, NAG and their ZO counterparts. ZO optimization tracks FO closely in both loss and classification performance, with empirical iteration ratios (ZO/FO) near zk+1=w(zk)z_{k+1} = w(z_k)5 across a range of dimensions (zk+1=w(zk)z_{k+1} = w(z_k)6 to zk+1=w(zk)z_{k+1} = w(z_k)7).

The per-iteration computational overhead in ZO (from extra function evaluations) is dominated by variance control—oracle averaging (zk+1=w(zk)z_{k+1} = w(z_k)8)—and is not intrinsic to the optimization dynamics.

Implications and Theoretical Significance

This work effectively reinterprets the "ZO curse of dimensionality": iteration complexity is not inherently dimension-cursed; rather, it is the attainable accuracy (final neighbourhood) that scales with dimension due to the smoothing bias. Asymptotic convergence can be made arbitrarily close to that of FO methods, provided computational resources are allocated to decrease zk+1=w(zk)z_{k+1} = w(z_k)9 and increase zk+1=w(zk)+qkz_{k+1} = w(z_k) + q_k0 appropriately. Importantly, the same ISS-based argument applies automatically to a wide class of iterative optimizers, removing the need for algorithm- and problem-specific ZO analyses.

Practically, this substantially lowers the perceived gap between ZO and FO when gradients are unavailable, supporting the empirical successes of high-dimensional ZO approaches in large-scale fine-tuning ([malladi2023fine], [zhang2024revisiting]).

Potential Extensions and Future Directions

Key directions include:

  • Extension to nonconvex, nonsmooth optimization, leveraging Lyapunov/ISS tools for broader equilibria and attractors.
  • High-probability and finite-time deviation guarantees, tightening the link between expectation-based and pathwise analysis.
  • Data-driven and adaptive schemes for dynamic parameter tuning of zk+1=w(zk)+qkz_{k+1} = w(z_k) + q_k1 and zk+1=w(zk)+qkz_{k+1} = w(z_k) + q_k2, automating the trade-off between accuracy and iteration cost.

The framework is modular and poised for further generalizations, including constrained and stochastic settings.

Conclusion

By recasting ZO optimization within the ISS paradigm, this work (2604.25372) reframes the longstanding narrative of dimension-dependent inefficiency. ZO methods, when correctly parameterized, inherit the convergence rates of their FO counterparts in expectation, with the only gap being a neighbourhood radius tunable by algorithmic parameters. This has significant implications for theory and practice in derivative-free and black-box optimization, reinforcing the credibility and applicability of ZO algorithms in large-scale problems where gradient access is a luxury.


Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.