Prob-SARAH: Loopless Stochastic Variance Reduction
- Prob-SARAH is a loopless, variance-reduced algorithm for finite-sum optimization that addresses both convex and nonconvex objectives through stochastic recursive gradient estimators.
- Its randomized restart mechanism eliminates the need for nested loops, simplifies implementation, and achieves optimal complexity in expectation and with high probability.
- Empirical evaluations show that Prob-SARAH effectively controls gradient variance, leading to improved performance in logistic regression and neural network training tasks.
Prob-SARAH is a family of stochastic recursive variance-reduced algorithms for finite-sum optimization, designed to address both convex and nonconvex objectives. It generalizes the SARAH method by deploying a randomized "loopless" architecture, often referred to as Loopless SARAH (L2S), and, more recently, by providing probabilistic guarantees on its stochastic recursive gradient estimators. Prob-SARAH achieves optimal complexity in both expectation and high probability, matching or improving upon previous results for finite-sum problems in both theory and empirical performance (Li et al., 2019, &&&1&&&).
1. Formulation and Algorithmic Principles
Given an objective of finite-sum form,
with each assumed to be -smooth (and potentially nonconvex), Prob-SARAH targets finding approximate stationary points: Unlike classic SARAH, which uses a double-loop structure, Prob-SARAH/Loopless SARAH replaces the double loop by a single loop where, at each iteration , the algorithm probabilistically "restarts" (computes a full gradient with probability for a positive integer ) or applies a SARAH-style recursive update otherwise.
The canonical (loopless) algorithm proceeds:
- At each iteration :
- With probability $1/m$: set .
- Otherwise: sample uniformly and set
- Update the iterate:
- Output a randomly selected iterate from the trajectory.
Prob-SARAH maintains a biased but tightly controlled estimator; the mean-squared error of the gradient estimator decays exponentially between restarts, which is pivotal for convergence analysis (Li et al., 2019).
2. Probabilistic Guarantees and High-Probability Complexity
Prob-SARAH extends classic in-expectation analysis to high-probability statements. The high-probability regime is motivated by the need for single-run performance guarantees, particularly for robust optimization.
A new dimension-free Azuma–Hoeffding inequality for vector-valued martingales with random individual norm bounds enables these high-probability results. For a martingale-difference sequence with norm bounds , the following holds (with probability at least ): for any , except possibly on rare large-deviation trajectories (Zhong et al., 2024).
By adapting the recursive estimator and employing parameter schedules tied to statistical confidence, Prob-SARAH achieves: with total stochastic-gradient complexity
where logarithmic factors in and problem parameters are suppressed.
3. Convergence, Step Size Regimes, and Complexity Bounds
Strongly Convex Objective
If is -strongly convex, Prob-SARAH achieves linear convergence up to constants. Selecting , where , yields information-theoretic optimality: gradient evaluations. If each is individually strongly convex, suffices (Li et al., 2019).
Convex Objective
- With an -independent step size :
complexity with or .
- With an -dependent step size :
complexity. This regime is preferable when is large.
Original SARAH required additional non-divergence assumptions for -independent step sizes, which are not necessary for Prob-SARAH (Li et al., 2019).
Nonconvex Objective
Prob-SARAH matches the best known in-expectation bounds: gradient computations to find an -stationary point. In high probability, up to logarithmic factors,
stochastic gradients are required (Zhong et al., 2024).
4. Comparison with SARAH and Related Methods
| Feature | SARAH | Prob-SARAH (Loopless/L2S) |
|---|---|---|
| Loops | Double | Single, stochastic restarts |
| Step size (convex) | -independent allowed | |
| Non-divergence assumption | Needed | Not needed |
| Complexity (nonconvex) | Same, with high-probability bounds | |
| Gradient estimator variance | Lower, fixed restarts | Higher, randomized restarts |
| Generalization (empirical) | Standard | Often superior due to noise injection |
Prob-SARAH achieves algorithmic simplicity—eschewing nested loops in favor of Bernoulli-scheduled restarts. It permits larger step sizes in convex settings without additional assumptions. Empirical evidence shows that the variance-increasing effect of stochastic restarts enables better escape from sharp local minimizers in deep learning tasks, often improving test accuracy relative to SARAH (Li et al., 2019).
5. Empirical Evaluation and Applications
Prob-SARAH demonstrates strong empirical performance on both classic and modern machine learning problems. On logistic regression with nonconvex regularization (LIBSVM datasets: mushrooms/ijcnn1/w7a), Prob-SARAH attains lower high-quantiles of gradient-norm squared compared to SGD, SVRG, and SCSG, indicating superior probabilistic control over stationarity (Zhong et al., 2024). For training a two-layer neural network with GELU activations on MNIST, the method achieves the best probabilistic control of gradient-norm in early epochs and yields competitive validation accuracy, while baselines such as SVRG are prone to poor local minima.
These experiments validate the practical value of the high-probability theoretical guarantees: for users demanding risk control on gradient norms in individual runs, Prob-SARAH not only matches in-expectation results but typically offers improved reliability.
6. Technical Advances and Theoretical Significance
The core technical innovation enabling Prob-SARAH's high-probability guarantees is a new dimension-free Azuma–Hoeffding inequality for martingales with random norm-bounds. This analytic tool facilitates tight, sample-dependent error control in recursive variance-reduced gradient estimators (Zhong et al., 2024). The loopless structure of Prob-SARAH leads to an exponentially decaying memory effect, as reflected in convergence proofs, and eliminates the need for problem-specific outer-loop scheduling.
Prob-SARAH unifies and improves upon techniques from SARAH [Nguyen et al. 2017], SCSG [Lei & Jordan 2017], and recent loopless variants [Kovalev et al. 2019], demonstrating that stochastic recursion with randomized restarts attains state-of-the-art complexity across convex, strongly convex, and nonconvex optimization with robust probabilistic guarantees (Li et al., 2019, Zhong et al., 2024).