AI-SARAH: Adaptive Stochastic Gradient Method
- AI-SARAH is an adaptive, implicit, tune-free stochastic recursive gradient optimizer designed for large-scale convex finite-sum minimization.
- It automatically adjusts step-sizes online by exploiting local smoothness estimates and directional derivatives, enhancing convergence efficiency.
- The method outperforms traditional SARAH variants by reducing the need for manual hyperparameter tuning while ensuring robust variance reduction.
AI-SARAH is an adaptive, implicit, and tune-free stochastic recursive gradient optimization method designed for large-scale convex finite-sum minimization problems in machine learning. Developed as a practical advancement over the original SARAH and its variants such as SARAH⁺ and iSARAH, AI-SARAH introduces a mechanism to adapt step-sizes online to local smoothness, efficiently leveraging information from stochastic directional derivatives to accelerate convergence without the requirement of manual hyperparameter tuning or prior knowledge of global smoothness or strong convexity parameters (Shi et al., 2021).
1. Problem Setting and Algorithmic Background
AI-SARAH addresses unconstrained finite-sum minimization problems of the form:
where , is the loss for sample , and is the empirical risk. The classical assumption is that each is convex and twice continuously differentiable. The objective may be either -strongly convex () or merely convex (). Traditional smoothness is defined globally: , but AI-SARAH explicitly exploits local smoothness along line segments by considering estimates of the smallest constant satisfying
The SARAH family, introducing the stochastic recursive gradient estimator
motivates AI-SARAH’s efficient variance reduction and memoryless advantages (Nguyen et al., 2017).
2. Adaptive and Implicit Step-Size Mechanism
Unlike SARAH, which employs a fixed step-size , AI-SARAH adaptively selects step-sizes at each inner iteration. At every inner step, the algorithm defines the subproblem
where is a sampled mini-batch. The step-size is chosen (approximately) to minimize . A one-step Newton update at yields
where \begin{align*} \xi_t'(0) &= -2 v_{t-1}T \nabla2 f_{i_t}(w_{t-1}) v_{t-1}, \ \xi_t''(0) &= 2 v_{t-1}T [\nabla2 f_{i_t}(w_{t-1})]2 v_{t-1} + 2 v_{t-1}T \nabla3 f_{i_t}(w_{t-1})[v_{t-1}] v_{t-1}. \end{align*} To stabilize the step-size, an exponential moving average of the harmonic mean of past reciprocals maintains an upper bound . The step-size for each iteration is then set as (Shi et al., 2021).
AI-SARAH thereby adjusts to local curvature and smoothness, estimating effective step-sizes dynamically without access to global or . This adaptation is not present in base SARAH or other fixed step-size variance-reduced methods.
3. Algorithmic Structure and Stopping Criteria
AI-SARAH operates in epochs (outer loops), each with up to inner steps, but can terminate earlier if for a default threshold . Pseudocode details include:
- Mini-batch size (default or ),
- Exponential smoothing parameter (default $0.999$),
- No explicit requirement for full knowledge of global or ,
- Updates using autodiff for efficient computation of and .
The algorithm’s stopping rule for the inner loop is identical in spirit to the SARAH⁺ variant, leveraging the observed decay in the squared norm of to avoid unnecessary iterations and maintain variance reduction efficiency (Nguyen et al., 2017).
4. Theoretical Convergence Properties
Under strong convexity, with -strongly convex and each convex (and using -local smoothness), AI-SARAH achieves linearly decaying expected gradient norm per outer epoch:
where
and total inner-loop step-sum , with . When local smoothness is much less than the global , is larger, leading to potentially faster convergence compared to standard SARAH. In particular, the rate recovers the classical SARAH rate when but improves upon it where local geometry allows (Shi et al., 2021).
5. Computational Complexity
Each AI-SARAH inner iteration consists of:
- One mini-batch gradient at and (),
- Two extra directional derivatives via automatic differentiation for and ( each).
Thus, per-inner-iteration cost is approximately $3b$ equivalent stochastic gradients, compared to for base SARAH. The outer full gradient is computed once per epoch (). For -accuracy in the strongly convex case, both AI-SARAH and classical SARAH require gradient-equivalents due to in practical parameterization. The factor-of-3 cost is offset by the improved adaptation and lack of hyperparameter tuning (Shi et al., 2021).
6. Empirical Performance and Robustness
Extensive experiments on logistic regression tasks, both regularized and unregularized, were conducted on 10 LIBSVM datasets (e.g., ijcnn1, rcv1, news20, gisette, mushrooms). Competitors included fine-tuned SARAH, SARAH+, SVRG, Adam, and SGD-momentum, with up to 5,000 hyperparameter configurations evaluated per dataset for non-AI-SARAH methods.
Key comparative results:
- AI-SARAH, with default settings (, , ), uniformly outperformed or matched the best tuned SARAH, SARAH+, and SVRG algorithms in convergence per effective pass and wall-clock time.
- AI-SARAH was competitive or faster than Adam and SGD-momentum in reaching lower gradient norms on convex problems.
- AI-SARAH exhibited robust performance across all problems with no per-dataset tuning.
An example comparison table for final after 20 passes (regularized setting):
| Dataset | AI-SARAH | Best SARAH | Best SVRG | Best Adam | Best SGD-m |
|---|---|---|---|---|---|
| ijcnn1 | 1.2e–6 | 8.7e–6 | 1.0e–5 | 3.5e–6 | 5.1e–6 |
| rcv1 | 2.3e–7 | 1.5e–6 | 2.1e–6 | 4.4e–7 | 8.9e–7 |
Values represent mean outcomes over 10 seeds, indicating that AI-SARAH converges more rapidly without requiring hyperparameter tuning (Shi et al., 2021).
7. Implementation Guidelines and Practical Considerations
Recommended implementation details for AI-SARAH include:
- Default parameters (, , or $0.1n$) are robust across tasks.
- The subproblem for step-size can be solved via a one-step Newton update at , evaluable by two backward-mode autodiff passes; for quadratic loss, a closed form is available.
- Exponential smoothing on is used to update ; i.e., .
- Stopping the inner loop upon or after steps.
- No requirement to estimate global or .
- Efficient integration into deep learning frameworks (e.g., PyTorch, TensorFlow) by treating the step-size as a leaf variable in the computational graph and computing required derivatives through backpropagation (Shi et al., 2021).
8. Relation to SARAH, SARAH⁺, and iSARAH
AI-SARAH shares the stochastic recursive gradient paradigm with SARAH, which achieves linear convergence under strong convexity and offers the unique property that the inner-loop estimator exhibits linear convergence in expectation within a single loop (Nguyen et al., 2017). SARAH⁺ introduces adaptive inner-loop termination based on norm decay, a design inherited by AI-SARAH. The inexact SARAH (iSARAH) generalizes the methodology to expectation minimization problems beyond finite sums by using stochastic, rather than exact, full gradients, and adjusts batch sizes per theoretical analysis (Nguyen et al., 2018).
The defining advancement in AI-SARAH lies in automating step-size adjustment guided by local geometry—rendering it entirely tune-free in practice—while preserving the memory efficiency and convergence guarantees characteristic of the SARAH lineage.