Gradient-Free Optimization
- Gradient-free algorithms are optimization methods that do not require derivatives, instead leveraging function evaluations to estimate search directions.
- They employ techniques such as randomized smoothing and finite-difference estimation to effectively handle non-differentiable, discontinuous, or noisy objectives.
- Applications span machine learning, quantum optimization, and engineering design where gradient calculations are infeasible or unreliable.
Gradient-free algorithms—also termed derivative-free, black-box, or zeroth-order optimization algorithms—are a class of optimization methods that do not require computation of gradients of the objective function. Instead, they operate solely via successive queries to a function-value oracle, estimating search directions or generating new candidate solutions using only function evaluations. These methods are indispensable when gradients are inaccessible, unreliable, computationally expensive to obtain, or the objective is non-differentiable, discontinuous, or noisy.
1. Foundations and Theoretical Guarantees
Gradient-free optimization transforms the standard iterative paradigm of optimization by dispensing with explicit gradient computation, instead constructing update rules based on stochastic or deterministic estimators derived from function value comparisons. The core theoretical foundation is "randomized smoothing": the optimizer queries the objective at points sampled near the iterate, and reconstructs a gradient estimate via finite differences, randomization over spheres or balls, population-based strategies, or non-commutative exploration (Arrasmith et al., 2020, Lin et al., 2022, Akhavan et al., 2023, Yuan et al., 2020).
A defining result is that the uniform smoothing operator
yields a smooth approximation of a Lipschitz (possibly nonsmooth, nonconvex) function , with gradients which can be estimated unbiasedly by two-point finite-difference estimators (Lin et al., 2022, Chen et al., 2023). The Goldstein -stationarity concept further extends optimality to the subdifferential context, guaranteeing that points with small smoothed gradient norm are near stationary for the original nonsmooth objective.
Minimax lower bounds have established oracle complexity rates for various classes. For -dimensional, strongly convex and -smooth , the best-achievable convergence rate is
for function evaluations (Akhavan et al., 2023). For nonconvex, nonsmooth objectives, gradient-free approaches achieve the best known rate for finding -Goldstein stationary points, with sharp upper and lower bounds (Lin et al., 2022, Chen et al., 2023).
2. Gradient-Free Estimators and Algorithms
The principal gradient-free estimators and associated algorithms include:
- Randomized Directional Smoothing: At each iteration, sample a random direction (from the -sphere [Bach–Perchet 2016], or the -sphere (Akhavan et al., 2023)) and compute a two-point finite difference to estimate the directional derivative. Typical estimators are:
for a random direction.
- Kernelized and Higher-Order Estimators: Use orthogonal polynomial kernels and higher-order randomization to exploit -smoothness , reducing bias and variance (Akhavan et al., 2023, Beznosikov et al., 2021).
- Population-Based and Evolutionary Algorithms: Maintain a population of candidates (e.g., evolutionary strategies, genetic algorithms). Update through selection, crossover, and mutation based on function value comparisons or fitness scores, independent of gradient information (Alzantot et al., 2018, Liu et al., 12 Oct 2025).
- Projection-Free Zeroth-Order Frank–Wolfe: Combine zeroth-order gradient estimation with Frank–Wolfe steps, dispensing with explicit projections and using only a linear minimization oracle (Sahu et al., 2018).
- Recursive Variance-Reduced Zeroth-Order Methods: Adapt SPIDER/SARAH-type recursive estimators to smoothly approximate nonconvex objectives (Chen et al., 2023).
- Non-Commutative Map Methods: Apply cyclic parametric perturbations and Lie-bracket-based function compositions to recover gradient-like directions via noncommutative interactions (Feiling et al., 2020).
3. Complexity and Rate Results
Key complexity rates, as analytically and empirically established, are summarized as follows (see references for the precise algorithms achieving these rates):
| Objective Class | Strong Convexity | Smoothness | Complexity | Notable References |
|---|---|---|---|---|
| Nonsmooth Convex | Yes | Lipschitz | (Beznosikov et al., 2021Lin et al., 2022) | |
| Nonsmooth Nonconvex | No | Lipschitz | (Lin et al., 2022) | |
| Nonsmooth Nonconvex VR | No | Lipschitz | (Chen et al., 2023) | |
| Smooth Convex | Yes | (Akhavan et al., 2023) | ||
| Strongly Convex | Yes | Minimax optimal | (Akhavan et al., 2023) | |
| PL Condition | No | Polylog improvement over convex | (Akhavan et al., 2023) | |
| Zeroth-Order FW (Convex) | Yes | (Sahu et al., 2018) |
Here, is the ambient dimension, is the desired stationarity or suboptimality, and is the total number of zeroth-order oracle calls.
Bias-variance tradeoffs, step size schedules, and smoothing radius choices are critical; for -based randomization, sharper dimension constants can be achieved in the noiseless setting (Akhavan et al., 2023).
4. Advanced Variants and Distributed Settings
Gradient-free methods have been extended to a range of advanced settings:
- Saddle-Point and Minimax Problems: Randomized mirror descent and one-point estimators yield complexity for non-smooth and in smooth regimes for convex-concave structures, with kernelized schemes offering improved rates under higher-order smoothness (Beznosikov et al., 2021).
- Distributed and Online Optimization: Compressed communication, consensus protocols, and error-feedback mechanisms enable scalable zeroth-order distributed optimization with provable regret and communication efficiency (Zhu et al., 5 Dec 2025).
- Stochastic or Markovian Noise: By leveraging randomized batching and multilevel Monte Carlo, modern algorithms remove any dependence on the Markov chain mixing time when , achieving optimal rates even with dependent noise (Prokhorov et al., 3 Jan 2026).
5. Applications: Machine Learning, Quantum Algorithms, and Engineering
Gradient-free optimization is increasingly critical in settings where gradients are expensive or unavailable:
- Deep Neural Networks and LLMs: Evolutionary strategies enable full-parameter training of LLMs and large transformers without backpropagation, supporting non-differentiable or black-box architectures (Liu et al., 12 Oct 2025, Kus et al., 2024). Pretrained meta-models (e.g., TabPFN) facilitate gradient-free reinforcement learning with performance rivaling DQN in low-dimensional control (Schiff et al., 14 Sep 2025).
- Robustness and Adversarial Attacks: Black-box adversarial attacks on DNNs and Bayesian neural networks (BNNs) use genetic algorithms and zeroth-order finite-difference methods to bypass obfuscated gradients and exploit predictive uncertainty, achieving query efficiency superior to coordinate-wise estimation (Alzantot et al., 2018, Yuan et al., 2020).
- Variational Quantum Circuits and VQAs: Rotosolve, Fraxis, and FQS, as gate-wise analytic or eigen-solver-based approaches, are immune to direct gradient vanishing but still suffer exponential scaling in cost differences under “barren plateaus” (Arrasmith et al., 2020, Pankkonen et al., 10 Jul 2025). Gate freezing strategies further mitigate measurement overhead in large parameterized quantum circuits.
- Engineering and Scientific Design: Gradient-free neural topology optimization leverages generative latent-variable models (e.g., LBAE) and CMA-ES to efficiently search high-dimensional, possibly discontinuous design spaces—e.g., compliance, fracture resistance, or robustness of structures (Kus et al., 2024).
6. Limitations and Open Challenges
Despite their generality, gradient-free methods face significant challenges:
- Curse of Dimensionality: Complexity rates frequently scale at least linearly or quadratically in the dimension , and improvements via model-based proposals, randomization over the -sphere, or latent reparameterization only partially mitigate this effect (Akhavan et al., 2023, Kus et al., 2024).
- Barren Plateaus and Vanishing Cost Differences: In variational quantum settings, both gradient-based and gradient-free optimizers can fail due to exponentially vanishing cost differences, demanding infeasibly high sampling precision (Arrasmith et al., 2020, Pankkonen et al., 10 Jul 2025).
- High Variance and Sample Inefficiency: Evolutionary strategies and genetic algorithms are susceptible to high estimator variance, slow convergence, and sample inefficiency, especially in nonconvex or rugged landscapes (Liu et al., 12 Oct 2025, Alzantot et al., 2018).
- Hyperparameter Sensitivity and Lack of Generalization: Many state-of-the-art variants require careful tuning of smoothing radii, mutation rates, population sizes, or kernel weights, and theoretical rates often reflect upper bounds with implicit large constants.
- Distributed and Online Tradeoffs: In distributed architectures, communication compression reduces convergence speed unless error correction mechanisms are carefully managed; variance grows with ambient dimension and consensus gap (Zhu et al., 5 Dec 2025).
7. Prospects and Directions for Future Research
Open directions include:
- Adaptive kernelization and smoothing algorithms that leverage local smoothness or structure estimation (Akhavan et al., 2023, Beznosikov et al., 2021).
- Variance reduction beyond SPIDER/SARAH for gradient-free oracles (Chen et al., 2023).
- Hybrid schemes combining estimated derivatives with surrogate gradient learning or population-based search (Liu et al., 12 Oct 2025).
- Unified barren landscape theory explaining extrapolation behavior for both first- and zeroth-order methods (Arrasmith et al., 2020, Pankkonen et al., 10 Jul 2025).
- Noise-aware and quantum-device-specific zeroth-order strategies (Pankkonen et al., 10 Jul 2025).
- More efficient high-dimensional latent search via learned generative priors or active subspace adaptation (Kus et al., 2024).
- Theoretical models for Markovian noise interaction and optimal tradeoffs between sampling, direction selection, and parallelization (Prokhorov et al., 3 Jan 2026).
In summary, gradient-free algorithms are a mathematically mature and rapidly evolving area with deep theoretical guarantees, wide-ranging applicability, and substantial challenges in high-dimensional, nonconvex, and noisy settings. Their continued development is critical for optimization in black-box, nondifferentiable, or resource-constrained environments.