2SPSA: Second-Order Stochastic Approximation

Updated 29 December 2025

2SPSA is a stochastic optimization algorithm that employs random perturbation estimators for both gradients and Hessians in noisy, high-dimensional environments.
It implements a Newton-type update with diminishing step and perturbation sizes to achieve stable convergence under simulation-based conditions.
Its parallel variant, PSPO, leverages multiple simultaneous evaluations to reduce variance and computational cost, outperforming first-order methods.

Second-Order Simultaneous Perturbation Stochastic Approximation (2SPSA) is a stochastic optimization algorithm designed for efficient Newton-type updates in high-dimensional, noisy, and simulation-based optimization settings where gradient and Hessian information are unavailable or expensive to compute. Building upon the first-order SPSA framework, 2SPSA constructs simultaneous random-direction finite-difference estimators for both gradient and Hessian, and implements a stochastic Newton update using these estimates. In parallel form, known as PSPO, the method is particularly suited to large-scale, cloud-based environments by leveraging parallel computations for variance reduction in finite-difference estimates. The classical 2SPSA, as well as its variants and competitors (e.g., Stein estimator, 2RDSA), have been analyzed for convergence guarantees, query complexities, and computational scaling in both theoretical and applied contexts (Alaeddini et al., 2017, Zhu, 2021, A. et al., 2015, Zhu et al., 2019).

1. Problem Formulation and Core Algorithm

The general objective addressed by 2SPSA is the noisy maximization (or minimization) of a real-valued differentiable objective $L(\theta)$ , where $\theta \in \mathbb{R}^p$ is a vector of parameters subject to inherent randomness in function output. For example, in stochastic epidemiological modeling, $\theta$ could encode transmission and recovery rates; the fitting task is to maximize a pseudo-likelihood given observed data (Alaeddini et al., 2017).

The core 2SPSA update has the form: $\theta_{k+1} = \theta_k - a_k\, \hat H_k^{-1} \, \hat g_k$ where $\hat g_k$ and $\hat H_k$ are stochastic simultaneous-perturbation estimates of the gradient and Hessian at iteration $k$ , and $a_k$ is a diminishing step-size sequence (Alaeddini et al., 2017, Zhu, 2021). The algorithm employs randomly sampled $\{\pm1\}^p$ perturbation vectors for finite-difference calculations, enabling unbiased estimates with minimal function evaluations per iteration.

2. Construction of Gradient and Hessian Estimators

Gradient Estimation

For each iteration, a two-sided gradient estimator along a random direction $\Delta$ (component-wise i.i.d. $\pm1$ ) is constructed as: $\hat g(\theta) = \frac{L(\theta + c \Delta) - L(\theta - c \Delta)}{2c}\, \Delta^{-1}$ where $c$ is a small positive scalar, and $\Delta^{-1}$ denotes the vector of element-wise reciprocals (Alaeddini et al., 2017). Multiple perturbation directions can be sampled in parallel to reduce variance through averaging, especially in PSPO.

Hessian Estimation

The Hessian estimator in 2SPSA is built by applying the finite-difference gradient formula at randomized, perturbed locations: $\hat H(\theta) = \frac{1}{2c} [\hat g(\theta + c \Delta) - \hat g(\theta - c \Delta)]\, \Delta^{-1}$ Alternatively, using two independent perturbations $\Delta, \tilde{\Delta} \in \{\pm1\}^p$ , the classic estimator is (Zhu, 2021): $\hat H_{k} = (2c_k \tilde{c}_k)^{-1} \left[ F^{+,+}_k - F^{-,+}_k + F^{+,-}_k - F^{-,-}_k \right] \operatorname{diag}(1 / \Delta_k) \operatorname{diag}(1 / \tilde{\Delta}_k)$ where the $F^{\cdot,\cdot}_k$ are function evaluations at double-perturbed points. Symmetrization ( $(\hat H + \hat H^\top)/2$ ) is applied to ensure a symmetric Hessian estimate.

3. Update Rules, Convergence, and Stopping Criteria

The Newton-type update in 2SPSA is complemented by gain control and smoothing for stability:

Step-size: $a_k = a/(k+1)^\alpha$ with $0.5 < \alpha \leq 1$
Perturbation size: $c_k = c/(k+1)^\gamma$ with $0 < \gamma \leq 1/6$
Hessian smoothing: $B_{k+1} = (1-w_k)B_k + w_k \hat H_k$ , $w_k = 1/(k+1)$ (Zhu, 2021).

Under standard Robbins-Monro-type conditions (e.g., $\sum a_k=\infty$ , $\sum a_k^2/c_k^2<\infty$ ), and assuming $L$ is twice continuously differentiable with a unique maximum, convergence of $\theta_k \to \theta^*$ and $B_k \to \nabla^2 L(\theta^*)$ occurs almost surely. The expected mean-squared error decays at $O(k^{-2/3})$ (Alaeddini et al., 2017, Zhu, 2021).

Typical stopping criteria include:

Stationarity: $\|\hat g_k\| < \text{tol}_1$
Update magnitude: $\|\theta_{k+1} - \theta_k\| < \text{tol}_2$
Iteration count exceeded: $k \geq k_{\max}$ (Alaeddini et al., 2017).

4. Parallelization and Practical Implementation

The parallel stochastic perturbation optimization (PSPO) variant exploits high-performance computing for simultaneous function evaluations. At each iteration, multiple perturbation vectors are used in parallel, with gradient estimates combined via least-squares averaging to ensure the estimation error satisfies $E[\|\hat g_k - \nabla L(\theta_k)\|] \leq \epsilon_k$ if

$M_k \geq \sigma^2 p / (c_k^2 \epsilon_k^2)$

where $M_k$ is the number of directions and $\sigma^2$ is the observation noise variance (Alaeddini et al., 2017).

Empirical results in epidemiological inference show that PSPO achieves convergence in fewer iterations ( $\sim$ 10–20 vs. $\sim$ 25–30 for first-order SPSA), with classical methods failing due to computational cost and noise sensitivity (Alaeddini et al., 2017).

5. Computational Complexity and Efficient Implementations

Despite the per-iteration cost of 2SPSA being $O(1)$ in noisy function evaluations, the linear algebraic operations required by Newton updates (inverting $p \times p$ Hessian matrices, preconditioning) scale as $O(p^3)$ . Recent developments employ symmetric indefinite factorization (Bunch–Kaufman, LDL $^\top$ schemes) to reduce this to $O(p^2)$ FLOPs per iteration (Zhu et al., 2019). The factorization approach enables rank-one updates to the Hessian estimate, block-wise eigenvalue corrections for numerical stability, and efficient computation of the Newton direction, all while maintaining convergence and the normality rate of standard 2SPSA.

6. Comparisons and Variants

First-Order SPSA

First-order SPSA employs a similar perturbation-based estimator for the gradient only, using two function evaluations per iteration and updating $\theta_{k+1} = \theta_k + a_k \hat g_k$ . Standard finite-difference gradient ascent requires $p+1$ evaluations per iteration and is inefficient in large $p$ or noisy settings (Alaeddini et al., 2017).

Stein-Identity and RDSA Approaches

Several alternatives to 2SPSA have been developed to further reduce per-iteration query cost and relax smoothness requirements:

Stein-identity–based Hessian estimation uses a single perturbation and three function evaluations per iteration, with looser smoothness ( $C^3$ with Lipschitz third derivatives) and reduced hyperparameter tuning (Zhu, 2021).
Newton-RDSA (second-order random directions stochastic approximation) achieves unbiased gradient/Hessian estimates with only three (vs. four for 2SPSA) loss measurements per iteration and, in its asymmetric Bernoulli variant, demonstrates strictly smaller asymptotic MSE per function call and lower empirical cost (A. et al., 2015).

Method	Function Evals/Iter (Hessian)	Perturbation Types
2SPSA	4	Two i.i.d. Rademacher
2RDSA	3	One random direction
Stein-Identity	3	One Gaussian

Both 2RDSA and Stein-based estimators outperform 2SPSA in query complexity and, in the case of asymmetric Bernoulli 2RDSA, in asymptotic efficiency (A. et al., 2015, Zhu, 2021).

7. Algorithmic Summary and Pseudocode

The canonical 2SPSA/PSPO algorithm consists of the following steps at iteration $k$ (Alaeddini et al., 2017):

Compute step-size $a_k$ , perturbation size $c_k$ , and number of perturbations $M_k$ to control estimator variance.
In parallel (for PSPO), sample $M_k$ independent $\{\pm1\}^p$ perturbation directions.
Compute two-sided function evaluations to obtain gradient estimates.
Average the gradient estimates (e.g., via least-squares if $M_k \geq p$ ).
Apply similar perturbative schemes to estimate the Hessian.
Update $\theta_{k+1} = \theta_k - a_k \hat H_k^{-1} \hat g_k$ .
Check stopping criteria.

Concise LaTeX-style pseudocode is provided in (Alaeddini et al., 2017), and the mathematical structure carries over directly to high-performance implementations via indefinite factorization as in (Zhu et al., 2019).

References:

(Alaeddini et al., 2017) Application of a Second-order Stochastic Optimization Algorithm for Fitting Stochastic Epidemiological Models
(Zhu, 2021) Hessian Estimation via Stein's Identity in Black-Box Problems
(A. et al., 2015) Adaptive system optimization using random directions stochastic approximation
(Zhu et al., 2019) Efficient Implementation of Second-Order Stochastic Approximation Algorithms in High-Dimensional Problems