Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parallel SPSA (PSPO) Optimization

Updated 29 December 2025
  • PSPO is a stochastic optimization algorithm that uses multiple simultaneous perturbations to approximate gradients for noisy, expensive black-box functions.
  • It leverages parallel computing to reduce estimator variance, resulting in faster convergence compared to traditional SPSA.
  • Its applications include complex simulations like epidemiological model calibration, where it efficiently balances computational cost and accuracy.

Parallel Simultaneous Perturbation Optimization (PSPO) is a stochastic gradient-based optimization algorithm tailored for maximizing expected values of expensive, noisy black-box functions. PSPO generalizes the classical Simultaneous Perturbation Stochastic Approximation (SPSA) method by leveraging multiple simultaneous perturbations in each iteration and exploiting parallel computing architectures. Its design addresses high-variance gradient estimates that arise when optimizing complex stochastic systems—such as those found in epidemiology or simulation-driven sciences—where function evaluations are computationally intensive and noise is significant (Alaeddini et al., 2017, Alaeddini et al., 2017).

1. Formulation of the Stochastic Optimization Problem

PSPO is applied to problems of the form: maxθRpL(θ)=E[f(θ)]\max_{\theta \in \mathbb{R}^p} L(\theta) = \mathbb{E}[f(\theta)] where f(θ)f(\theta) is a continuously differentiable mean function and observed outputs are: y(θ)=f(θ)+ϵ(θ),ϵ(θ)N(0,σ2)y(\theta) = f(\theta) + \epsilon(\theta), \qquad \epsilon(\theta) \sim \mathcal{N}(0, \sigma^2) The function ff is accessed only through noisy and costly queries, for example, calls to a Monte Carlo simulator. The optimization is fully derivative-free, using only noisy function-value queries to approximate gradients and (optionally) Hessians.

2. Review of Simultaneous Perturbation Stochastic Approximation (SPSA)

SPSA constructs a first-order estimate of the gradient at each iteration by evaluating two points, perturbed from the current parameter θk\theta_k by a random vector Δk{±1}p\Delta_k \in \{\pm 1\}^p: yk(+)=y(θk+ckΔk),yk()=y(θkckΔk)y_k^{(+)} = y(\theta_k + c_k \Delta_k), \qquad y_k^{(-)} = y(\theta_k - c_k \Delta_k) with ck>0c_k > 0 a perturbation parameter. The stochastic gradient estimate is: g^k(θk)=yk(+)yk()2ckΔk1\hat{g}_k(\theta_k) = \frac{y_k^{(+)} - y_k^{(-)}}{2 c_k} \cdot \Delta_k^{-1} where Δk1\Delta_k^{-1} is the element-wise inverse. Parameter updates use a Robbins–Monro style diminishing step sequence aka_k: θk+1=θk+akg^k(θk)\theta_{k+1} = \theta_k + a_k \hat{g}_k(\theta_k) SPSA estimator is unbiased but can suffer from high variance when function noise is substantial, resulting in many required iterations for convergence (Alaeddini et al., 2017, Alaeddini et al., 2017).

3. PSPO Algorithmic Structure

PSPO trades SPSA's minimal iteration cost for reduced estimator variance via MM-way parallelism. In each iteration, MM independent random Bernoulli perturbations {Δk,1,...,Δk,M}\{\Delta_{k,1}, ..., \Delta_{k,M}\} are sampled; for each direction, two function values are computed in parallel: yi(+)=y(θk+ckΔk,i),yi()=y(θkckΔk,i)y^{(+)}_i = y(\theta_k + c_k \Delta_{k,i}), \qquad y^{(-)}_i = y(\theta_k - c_k \Delta_{k,i}) For MpM \geq p and linearly independent Δ\Delta, the minimum-variance unbiased gradient estimator is: G^k=1ck(ΔΔT)1Δδf\hat{G}_k = \frac{1}{c_k} (\Delta\Delta^T)^{-1} \Delta \delta f where δf\delta f is the vector of finite-difference ratios yi(+)yi()2ck\frac{y_i^{(+)} - y_i^{(-)}}{2c_k}. For M<pM < p, a minimum-norm solution is used: G^k=1ckΔ(ΔTΔ)1δf\hat{G}_k = \frac{1}{c_k} \Delta (\Delta^T\Delta)^{-1} \delta f This estimator is unbiased: E[G^k]=f(θk)\mathbb{E}[\hat{G}_k] = \nabla f(\theta_k); its variance decreases rapidly with MM and with the invertibility and conditioning of Δ\Delta.

PSPO incorporates a nonlinear conjugate-gradient (NCG) update as its outer loop. The residual rk=G^kr_k = -\hat{G}_k and daggered direction dkd_k are updated by: dk=rk+βkdk1,βk=G^kT(G^kG^k1)G^k1TG^k1d_k = r_k + \beta_k d_{k-1}, \qquad \beta_k = \frac{\hat{G}_k^T (\hat{G}_k - \hat{G}_{k-1})}{\hat{G}_{k-1}^T \hat{G}_{k-1}} A reduced-Hessian estimate in direction dkd_k is constructed for line search: Hdk(θk)G^(θk+ϵdk)G^(θkϵdk)2ϵdk2dkT+transposeH_{d_k}(\theta_k) \approx \frac{\hat{G}(\theta_k + \epsilon d_k) - \hat{G}(\theta_k - \epsilon d_k)}{2\epsilon \|d_k\|^2} d_k^T + \text{transpose} The step size is: αk=G^kTdkdkTHdkdk\alpha_k = -\frac{\hat{G}_k^T d_k}{d_k^T H_{d_k} d_k} yielding the updated parameters: θk+1=θk+αkdk\theta_{k+1} = \theta_k + \alpha_k d_k (Alaeddini et al., 2017).

4. Error Tolerance, Step Size, and Theoretical Guarantees

PSPO allows explicit control of the stochastic gradient error per iteration. To enforce E[G^kf(θk)]ϵk\mathbb{E}[\|\hat{G}_k - \nabla f(\theta_k)\|] \leq \epsilon_k, the required number of parallel perturbations is: Mmax{p,σ2pck2ϵk2}M \geq \max\left\{p, \frac{\sigma^2 p}{c_k^2 \epsilon_k^2}\right\} Step-size sequences for ak,cka_k,c_k diminish as ak=a/(k+A)αa_k = a/(k+A)^\alpha and ck=c/(k+1)γc_k = c/(k+1)^\gamma with α(0.5,1]\alpha \in (0.5,1] and γ(0.101,0.5]\gamma \in (0.101,0.5]; AA stabilizes initial steps. Under these conditions, PSPO converges almost surely to a local maximum of L(θ)L(\theta) (Alaeddini et al., 2017).

The variance of the PSPO gradient estimator is: Var(G^k)=2σ2ck2(ΔΔT)1\operatorname{Var}(\hat{G}_k) = \frac{2 \sigma^2}{c_k^2} (\Delta\Delta^T)^{-1} With the recommended construction of Δ\Delta, Tr[(ΔΔT)1]p/4\operatorname{Tr}[(\Delta\Delta^T)^{-1}] \leq p/4 for M4M \geq 4, implying variance scales as σ2p/(2ck2)\sigma^2 p/(2c_k^2) (Alaeddini et al., 2017).

5. Computational Complexity and Parallel Scalability

Each PSPO iteration incurs O(M)O(M) function evaluations but, when MM parallel compute resources are available, the wall-clock time per iteration matches that of SPSA's two serial function calls. The reduced variance of the PSPO estimator yields fewer required outer iterations, providing a speedup proportional to MM up to the regime where gradient noise no longer limits convergence. Communication and data aggregation overheads are negligible for moderate pp (e.g., p<20p < 20) and M<103M < 10^3. As MM grows and the aggregation of (ΔΔT)1(\Delta\Delta^T)^{-1} becomes significant, additional efficiency analysis is necessary (Alaeddini et al., 2017).

6. Applications and Benchmarks

PSPO's performance has been demonstrated on both synthetic and applied benchmarks:

  • Quadratic Toy Problem: For f(x)=x12+wf(x) = \|x - 1\|^2 + w with wN(0,9)w \sim \mathcal{N}(0,9) and p=5p = 5, PSPO reduced the number of required optimization iterations by approximately 50% compared to SPSA, with near-linear wall-clock speedup for MM up to 8 (Alaeddini et al., 2017).
  • Stochastic Epidemiological Model Calibration: PSPO was used to fit a two-parameter SIR model (β,γ)(\beta, \gamma) to 1861 Hagelloch measles outbreak data (188 cases). PSPO with M=8M=8 achieved convergence in about 10 iterations, compared to SPSA's 24, corresponding to a 2.5×\sim2.5\times reduction in iteration count and near-linear speedup in wall-clock time given 8 parallel cores. The final parameter estimates and uncertainty quantification (via Monte Carlo averaged Hessian) matched those of SPSA, with β0.45\beta^* \approx 0.45 day1^{-1}, γ0.15\gamma^* \approx 0.15 day1^{-1}, and 95% confidence intervals [0.43,0.47] and [0.14,0.16] respectively (Alaeddini et al., 2017, Alaeddini et al., 2017).
Algorithm Per-iteration evals Minimum wall-time (with MM workers) Convergence (iterations)
SPSA 2 2Tsim2 T_\text{sim} 24–26
PSPO (M=4M=4) 5 1Tsim\approx 1 T_\text{sim} 10–12

7. Practical Considerations and Extensions

PSPO's effectiveness depends on balancing MM, ckc_k, and the observed noise σ2\sigma^2. The Chebyshev-based lower bound for MM ensures that the gradient estimator achieves a prescribed error tolerance without excessive cost. Step-size and perturbation schedule parameters (a,c,α,γa, c, \alpha, \gamma) are best tuned via grid search on a coarse surrogate model.

PSPO lends itself to second-order variants by estimating the Hessian matrix efficiently in parallel, and by projecting the estimate onto the negative-definite cone to guarantee robust Newton-like steps. It can deliver natural Fisher-matrix-based uncertainty quantification with little additional computational overhead (Alaeddini et al., 2017).

The approach scales nearly linearly in MM for M<1000M < 1000 and pp modest, further supported by advances in high-performance cloud computing infrastructures.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel SPSA (PSPO).