AI-SARAH: Adaptive Stochastic Gradient Method

Updated 28 January 2026

AI-SARAH is an adaptive, implicit, tune-free stochastic recursive gradient optimizer designed for large-scale convex finite-sum minimization.
It automatically adjusts step-sizes online by exploiting local smoothness estimates and directional derivatives, enhancing convergence efficiency.
The method outperforms traditional SARAH variants by reducing the need for manual hyperparameter tuning while ensuring robust variance reduction.

AI-SARAH is an adaptive, implicit, and tune-free stochastic recursive gradient optimization method designed for large-scale convex finite-sum minimization problems in machine learning. Developed as a practical advancement over the original SARAH and its variants such as SARAH⁺ and iSARAH, AI-SARAH introduces a mechanism to adapt step-sizes online to local smoothness, efficiently leveraging information from stochastic directional derivatives to accelerate convergence without the requirement of manual hyperparameter tuning or prior knowledge of global smoothness or strong convexity parameters (Shi et al., 2021).

1. Problem Setting and Algorithmic Background

AI-SARAH addresses unconstrained finite-sum minimization problems of the form:

$P(w) = \frac{1}{n} \sum_{i=1}^n f_i(w),$

where $w \in \mathbb{R}^d$ , $f_i(w)$ is the loss for sample $i$ , and $P(w)$ is the empirical risk. The classical assumption is that each $f_i$ is convex and twice continuously differentiable. The objective may be either $\mu$ -strongly convex ( $\mu > 0$ ) or merely convex ( $\mu = 0$ ). Traditional smoothness is defined globally: $\|\nabla f_i(x) - \nabla f_i(y)\| \leq L_i \|x-y\|$ , but AI-SARAH explicitly exploits local smoothness along line segments by considering estimates of the smallest constant $L_i(w, v)$ satisfying

$f_i(w - \alpha v) \leq f_i(w) - \alpha \nabla f_i(w)^T v + \frac{L_i(w, v)}{2} \alpha^2 \|v\|^2, \quad \forall \alpha \in [0, 1/L_i(w, v)].$

The SARAH family, introducing the stochastic recursive gradient estimator

$v_0 = \nabla P(w_0), \quad v_t = v_{t-1} + \nabla f_{i_t}(w_t) - \nabla f_{i_t}(w_{t-1}),$

motivates AI-SARAH’s efficient variance reduction and memoryless advantages (Nguyen et al., 2017).

2. Adaptive and Implicit Step-Size Mechanism

Unlike SARAH, which employs a fixed step-size $\eta$ , AI-SARAH adaptively selects step-sizes at each inner iteration. At every inner step, the algorithm defines the subproblem

$\xi_t(\alpha) = \|\nabla f_{S_t}(w_{t-1} - \alpha v_{t-1}) - \nabla f_{S_t}(w_{t-1}) + v_{t-1} \|^2,$

where $S_t$ is a sampled mini-batch. The step-size $\tilde\alpha_{t-1}$ is chosen (approximately) to minimize $\xi_t(\alpha)$ . A one-step Newton update at $\alpha=0$ yields

$\tilde\alpha_{t-1} = -\frac{\xi_t'(0)}{|\xi_t''(0)|},$

where \begin{align*} \xi_t'(0) &= -2 v_{t-1}^T \nabla² f_{i_t}(w_{t-1}) v_{t-1}, \ \xi_t''(0) &= 2 v_{t-1}^T [\nabla² f_{i_t}(w_{t-1})]² v_{t-1} + 2 v_{t-1}^T \nabla³ f_{i_t}(w_{t-1})[v_{t-1}] v_{t-1}. \end{align*} To stabilize the step-size, an exponential moving average of the harmonic mean of past reciprocals $1/\tilde\alpha_{t-1}$ maintains an upper bound $\alpha_{\max}$ . The step-size for each iteration is then set as $\alpha_{t-1} = \min\{\tilde\alpha_{t-1}, \alpha_{\max}\}$ (Shi et al., 2021).

AI-SARAH thereby adjusts to local curvature and smoothness, estimating effective step-sizes dynamically without access to global $L$ or $\mu$ . This adaptation is not present in base SARAH or other fixed step-size variance-reduced methods.

3. Algorithmic Structure and Stopping Criteria

AI-SARAH operates in epochs (outer loops), each with up to $m$ inner steps, but can terminate earlier if $\|v_t\|^2 < \gamma \|v_0\|^2$ for a default threshold $\gamma=1/32$ . Pseudocode details include:

Mini-batch size $b$ (default $b=100$ or $b \approx 0.1n$ ),
Exponential smoothing parameter $\beta$ (default $0.999$),
No explicit requirement for full knowledge of global $L$ or $\mu$ ,
Updates using autodiff for efficient computation of $\xi_t'(0)$ and $\xi_t''(0)$ .

The algorithm’s stopping rule for the inner loop is identical in spirit to the SARAH⁺ variant, leveraging the observed decay in the squared norm of $v_t$ to avoid unnecessary iterations and maintain variance reduction efficiency (Nguyen et al., 2017).

4. Theoretical Convergence Properties

Under strong convexity, with $P$ $\mu$ -strongly convex and each $f_i$ convex (and using $L_i^t$ -local smoothness), AI-SARAH achieves linearly decaying expected gradient norm per outer epoch:

$\mathbb{E}[\|\nabla P(\tilde w_k)\|^2] \leq \left( \prod_{\ell=1}^k \sigma_m^\ell \right) \|\nabla P(\tilde w_0)\|^2,$

where

$\sigma_m^k = \frac{1}{\mu \mathcal{H} + (\eta_0 L^0)/(2 - \eta_0 L^0)},$

and total inner-loop step-sum $\mathcal{H} = \sum_{t=0}^m \eta_t$ , with $\eta_t = 1/L^t$ . When local smoothness $L^t$ is much less than the global $L$ , $\mathcal{H}$ is larger, leading to potentially faster convergence compared to standard SARAH. In particular, the rate recovers the classical SARAH rate when $L_i^t = L$ but improves upon it where local geometry allows (Shi et al., 2021).

5. Computational Complexity

Each AI-SARAH inner iteration consists of:

One mini-batch gradient at $w_{t-1}$ and $w_t$ ( $O(bd)$ ),
Two extra directional derivatives via automatic differentiation for $\xi_t'(0)$ and $\xi_t''(0)$ ( $O(bd)$ each).

Thus, per-inner-iteration cost is approximately $3b$ equivalent stochastic gradients, compared to $b$ for base SARAH. The outer full gradient $\nabla P(w_0)$ is computed once per epoch ( $O(nd)$ ). For $\epsilon$ -accuracy in the strongly convex case, both AI-SARAH and classical SARAH require $O((n + n) \log(1/\epsilon))$ gradient-equivalents due to $m \approx n/b$ in practical parameterization. The factor-of-3 cost is offset by the improved adaptation and lack of hyperparameter tuning (Shi et al., 2021).

6. Empirical Performance and Robustness

Extensive experiments on logistic regression tasks, both regularized and unregularized, were conducted on 10 LIBSVM datasets (e.g., ijcnn1, rcv1, news20, gisette, mushrooms). Competitors included fine-tuned SARAH, SARAH+, SVRG, Adam, and SGD-momentum, with up to 5,000 hyperparameter configurations evaluated per dataset for non-AI-SARAH methods.

Key comparative results:

AI-SARAH, with default settings ( $\gamma=1/32$ , $\beta=0.999$ , $b=100$ ), uniformly outperformed or matched the best tuned SARAH, SARAH+, and SVRG algorithms in convergence per effective pass and wall-clock time.
AI-SARAH was competitive or faster than Adam and SGD-momentum in reaching lower gradient norms on convex problems.
AI-SARAH exhibited robust performance across all problems with no per-dataset tuning.

An example comparison table for final $\|\nabla P(w)\|^2$ after 20 passes (regularized setting):

Dataset	AI-SARAH	Best SARAH	Best SVRG	Best Adam	Best SGD-m
ijcnn1	1.2e–6	8.7e–6	1.0e–5	3.5e–6	5.1e–6
rcv1	2.3e–7	1.5e–6	2.1e–6	4.4e–7	8.9e–7

Values represent mean outcomes over 10 seeds, indicating that AI-SARAH converges more rapidly without requiring hyperparameter tuning (Shi et al., 2021).

7. Implementation Guidelines and Practical Considerations

Recommended implementation details for AI-SARAH include:

Default parameters ( $\gamma=1/32$ , $\beta=0.999$ , $b=100$ or $0.1n$) are robust across tasks.
The subproblem for step-size can be solved via a one-step Newton update at $\alpha=0$ , evaluable by two backward-mode autodiff passes; for quadratic loss, a closed form is available.
Exponential smoothing on $1/\tilde\alpha$ is used to update $\alpha_{\max}$ ; i.e., $\alpha_{\max} \leftarrow 1/(\beta \delta_{\text{old}} + (1-\beta)/\tilde\alpha)$ .
Stopping the inner loop upon $\|v_t\|^2 < \gamma\|v_0\|^2$ or after $m \approx 2n/b$ steps.
No requirement to estimate global $L$ or $\mu$ .
Efficient integration into deep learning frameworks (e.g., PyTorch, TensorFlow) by treating the step-size $\alpha$ as a leaf variable in the computational graph and computing required derivatives through backpropagation (Shi et al., 2021).

8. Relation to SARAH, SARAH⁺, and iSARAH

AI-SARAH shares the stochastic recursive gradient paradigm with SARAH, which achieves linear convergence under strong convexity and offers the unique property that the inner-loop estimator $v_t$ exhibits linear convergence in expectation within a single loop (Nguyen et al., 2017). SARAH⁺ introduces adaptive inner-loop termination based on $v_t$ norm decay, a design inherited by AI-SARAH. The inexact SARAH (iSARAH) generalizes the methodology to expectation minimization problems beyond finite sums by using stochastic, rather than exact, full gradients, and adjusts batch sizes per theoretical analysis (Nguyen et al., 2018).

The defining advancement in AI-SARAH lies in automating step-size adjustment guided by local geometry—rendering it entirely tune-free in practice—while preserving the memory efficiency and convergence guarantees characteristic of the SARAH lineage.

Markdown Report Issue Upgrade to Chat

References (3)

AI-SARAH: Adaptive and Implicit Stochastic Recursive Gradient Methods (2021)

SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient (2017)

Inexact SARAH Algorithm for Stochastic Optimization (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AI-SARAH.