- The paper's main contribution is establishing that ISS permits zeroth-order methods to match FO iteration complexity by controlling perturbations via parameter tuning.
- The methodology recasts optimization algorithms as dynamical systems, enabling precise error bounds independent of problem dimension except in the convergence neighbourhood.
- Empirical results on quadratic and neural network tasks validate that zeroth-order approaches, with proper smoothing, achieve performance nearly identical to first-order methods.
Introduction
The paper "From Cursed to Competitive: Closing the ZO-FO Gap via Input-to-State Stability" (2604.25372) challenges the widespread assumption that zeroth-order (ZO) optimization methods incur intrinsic, unavoidable complexity penalties with respect to the ambient problem dimension, relative to first-order (FO) methods. By reinterpreting ZO algorithms through the framework of dynamical systems and leveraging the machinery of input-to-state stability (ISS), the authors demonstrate that, under appropriate conditions, ZO methods can inherit the iteration complexity and decay rates of their FO analogues, with the only remaining difference being a perturbation-induced equilibrium neighbourhood whose radius is controllable and independent of the number of optimization iterations.
Background and Motivation
Zeroth-order optimization, relying only on function evaluations, is crucial in settings where gradients are unavailable or expensive—from adversarial attacks, black-box reinforcement learning, and simulation-based design, to large-scale hyperparameter or memory-efficient adaptations of deep models. Classical analyses (cf. [nesterov_random_2017], [ghadimi2013stochastic], [duchi2015optimal]) demonstrate an O(n) (dimension-dependent) penalty on iteration complexity for ZO methods, with additional work amortizing this cost under sparsity or effective-dimension structures ([wang2018stochastic], [yue2023zeroth]).
This paper revisits the supposed "dimensional curse," arguing instead that, with correct parameterization and a dynamical systems view, the iteration complexity can be made independent of dimension; only the radius of the convergence neighbourhood has explicit n dependence.
Main Results
The central insight is to represent both FO and ZO algorithms as discrete-time dynamical systems. The deterministic update for a FO algorithm is expressed as zk+1​=w(zk​), while the ZO version with randomized gradient surrogates becomes zk+1​=w(zk​)+qk​, where qk​ encodes the ZO estimation error/bias and acts as an additive perturbation. Given that ZO estimators can be made unbiased in expectation (using Gaussian smoothing, as in [nesterov_random_2017]), the expectation of the ZO update closely mirrors the FO system, with qk​ capturing both smoothing bias and finite-sample variance.
The authors apply the theory of input-to-state stability (ISS, see [sontag1989smooth], [jiang2001input], [kellett2023introduction]) to establish that if the unperturbed FO system is (locally exponentially or asymptotically) stable and the perturbations are uniformly bounded, then the perturbed system (the averaged ZO dynamics) converges with the same decay rate to a neighbourhood whose radius depends solely on the perturbation bound—not on the iteration count or the dimension directly.
ZO as Bounded Perturbation of FO
Detailed analyses are presented for:
- Gradient Descent (GD): For β-strongly convex and L-smooth objectives, the expected averaged ZO-GD iteration can be decomposed as the sum of the FO step and a bounded, parameter-controllable perturbation, with explicit upper bounds on qk​ involving h (step size), n0 (smoothing parameter), n1 (samples per iteration), and n2 (oracle variance).
- Momentum Methods (Heavy Ball and Nesterov's Accelerated Gradient, NAG): Both methods are recast in an augmented state-space; the deviation between ZO and FO iterates is bounded via similar arguments. Precise recursions are given for the error and its propagation through momentum, characterizing the effect of the smoothing and oracle noise.
- n3 Regularization for Non-Strongly-Convex Objectives: The contraction needed for ISS can be restored by n4 regularization, even when strong convexity fails. The unregularized minimizer proximity is traded off against contraction strength via the regularization parameter.
The dimension-dependence appears only in the neighbourhood size (specifically, smoothing bias is n5), not in the decay rate, which is inherited verbatim from the FO dynamics.
ISS Guarantees: Iteration Complexity Is Dimension-Free
Theoretically, for any GD, HB, or NAG system stable under standard conditions (strong convexity, appropriate choice of n6, n7, n8), the iteration complexity for ZO methods with parameters yielding sufficiently small perturbations is identical to that of FO methods. The explicit n9 iteration count penalty in the classical ZO literature is thus shown to be an artefact of prior, cruder analyses. The only irreducible zk+1​=w(zk​)0-dependence occurs in the neighbourhood radius, which can be made arbitrarily small with appropriate parameter scaling.
Numerical Results
Two experimental domains reinforce the theoretical claims:
- Strongly Convex Quadratic: For zk+1​=w(zk​)1, GD and ZO-GD are compared over varying zk+1​=w(zk​)2. Provided perturbations stay within the ISS-stabilized regime, the trajectory of ZO-GD shadows GD, with convergence rate intact and no dimension-induced slowdowns.
- Neural Network Classification (Nonconvex Case with zk+1​=w(zk​)3 Reg.): A two-layer ReLU network (zk+1​=w(zk​)4) on a binary MNIST task is trained with GD, HB, NAG and their ZO counterparts. ZO optimization tracks FO closely in both loss and classification performance, with empirical iteration ratios (ZO/FO) near zk+1​=w(zk​)5 across a range of dimensions (zk+1​=w(zk​)6 to zk+1​=w(zk​)7).
The per-iteration computational overhead in ZO (from extra function evaluations) is dominated by variance control—oracle averaging (zk+1​=w(zk​)8)—and is not intrinsic to the optimization dynamics.
Implications and Theoretical Significance
This work effectively reinterprets the "ZO curse of dimensionality": iteration complexity is not inherently dimension-cursed; rather, it is the attainable accuracy (final neighbourhood) that scales with dimension due to the smoothing bias. Asymptotic convergence can be made arbitrarily close to that of FO methods, provided computational resources are allocated to decrease zk+1​=w(zk​)9 and increase zk+1​=w(zk​)+qk​0 appropriately. Importantly, the same ISS-based argument applies automatically to a wide class of iterative optimizers, removing the need for algorithm- and problem-specific ZO analyses.
Practically, this substantially lowers the perceived gap between ZO and FO when gradients are unavailable, supporting the empirical successes of high-dimensional ZO approaches in large-scale fine-tuning ([malladi2023fine], [zhang2024revisiting]).
Potential Extensions and Future Directions
Key directions include:
- Extension to nonconvex, nonsmooth optimization, leveraging Lyapunov/ISS tools for broader equilibria and attractors.
- High-probability and finite-time deviation guarantees, tightening the link between expectation-based and pathwise analysis.
- Data-driven and adaptive schemes for dynamic parameter tuning of zk+1​=w(zk​)+qk​1 and zk+1​=w(zk​)+qk​2, automating the trade-off between accuracy and iteration cost.
The framework is modular and poised for further generalizations, including constrained and stochastic settings.
Conclusion
By recasting ZO optimization within the ISS paradigm, this work (2604.25372) reframes the longstanding narrative of dimension-dependent inefficiency. ZO methods, when correctly parameterized, inherit the convergence rates of their FO counterparts in expectation, with the only gap being a neighbourhood radius tunable by algorithmic parameters. This has significant implications for theory and practice in derivative-free and black-box optimization, reinforcing the credibility and applicability of ZO algorithms in large-scale problems where gradient access is a luxury.