Papers
Topics
Authors
Recent
Search
2000 character limit reached

AI-SARAH: Adaptive Stochastic Gradient Method

Updated 28 January 2026
  • AI-SARAH is an adaptive, implicit, tune-free stochastic recursive gradient optimizer designed for large-scale convex finite-sum minimization.
  • It automatically adjusts step-sizes online by exploiting local smoothness estimates and directional derivatives, enhancing convergence efficiency.
  • The method outperforms traditional SARAH variants by reducing the need for manual hyperparameter tuning while ensuring robust variance reduction.

AI-SARAH is an adaptive, implicit, and tune-free stochastic recursive gradient optimization method designed for large-scale convex finite-sum minimization problems in machine learning. Developed as a practical advancement over the original SARAH and its variants such as SARAH⁺ and iSARAH, AI-SARAH introduces a mechanism to adapt step-sizes online to local smoothness, efficiently leveraging information from stochastic directional derivatives to accelerate convergence without the requirement of manual hyperparameter tuning or prior knowledge of global smoothness or strong convexity parameters (Shi et al., 2021).

1. Problem Setting and Algorithmic Background

AI-SARAH addresses unconstrained finite-sum minimization problems of the form:

P(w)=1ni=1nfi(w),P(w) = \frac{1}{n} \sum_{i=1}^n f_i(w),

where wRdw \in \mathbb{R}^d, fi(w)f_i(w) is the loss for sample ii, and P(w)P(w) is the empirical risk. The classical assumption is that each fif_i is convex and twice continuously differentiable. The objective may be either μ\mu-strongly convex (μ>0\mu > 0) or merely convex (μ=0\mu = 0). Traditional smoothness is defined globally: fi(x)fi(y)Lixy\|\nabla f_i(x) - \nabla f_i(y)\| \leq L_i \|x-y\|, but AI-SARAH explicitly exploits local smoothness along line segments by considering estimates of the smallest constant Li(w,v)L_i(w, v) satisfying

fi(wαv)fi(w)αfi(w)Tv+Li(w,v)2α2v2,α[0,1/Li(w,v)].f_i(w - \alpha v) \leq f_i(w) - \alpha \nabla f_i(w)^T v + \frac{L_i(w, v)}{2} \alpha^2 \|v\|^2, \quad \forall \alpha \in [0, 1/L_i(w, v)].

The SARAH family, introducing the stochastic recursive gradient estimator

v0=P(w0),vt=vt1+fit(wt)fit(wt1),v_0 = \nabla P(w_0), \quad v_t = v_{t-1} + \nabla f_{i_t}(w_t) - \nabla f_{i_t}(w_{t-1}),

motivates AI-SARAH’s efficient variance reduction and memoryless advantages (Nguyen et al., 2017).

2. Adaptive and Implicit Step-Size Mechanism

Unlike SARAH, which employs a fixed step-size η\eta, AI-SARAH adaptively selects step-sizes at each inner iteration. At every inner step, the algorithm defines the subproblem

ξt(α)=fSt(wt1αvt1)fSt(wt1)+vt12,\xi_t(\alpha) = \|\nabla f_{S_t}(w_{t-1} - \alpha v_{t-1}) - \nabla f_{S_t}(w_{t-1}) + v_{t-1} \|^2,

where StS_t is a sampled mini-batch. The step-size α~t1\tilde\alpha_{t-1} is chosen (approximately) to minimize ξt(α)\xi_t(\alpha). A one-step Newton update at α=0\alpha=0 yields

α~t1=ξt(0)ξt(0),\tilde\alpha_{t-1} = -\frac{\xi_t'(0)}{|\xi_t''(0)|},

where \begin{align*} \xi_t'(0) &= -2 v_{t-1}T \nabla2 f_{i_t}(w_{t-1}) v_{t-1}, \ \xi_t''(0) &= 2 v_{t-1}T [\nabla2 f_{i_t}(w_{t-1})]2 v_{t-1} + 2 v_{t-1}T \nabla3 f_{i_t}(w_{t-1})[v_{t-1}] v_{t-1}. \end{align*} To stabilize the step-size, an exponential moving average of the harmonic mean of past reciprocals 1/α~t11/\tilde\alpha_{t-1} maintains an upper bound αmax\alpha_{\max}. The step-size for each iteration is then set as αt1=min{α~t1,αmax}\alpha_{t-1} = \min\{\tilde\alpha_{t-1}, \alpha_{\max}\} (Shi et al., 2021).

AI-SARAH thereby adjusts to local curvature and smoothness, estimating effective step-sizes dynamically without access to global LL or μ\mu. This adaptation is not present in base SARAH or other fixed step-size variance-reduced methods.

3. Algorithmic Structure and Stopping Criteria

AI-SARAH operates in epochs (outer loops), each with up to mm inner steps, but can terminate earlier if vt2<γv02\|v_t\|^2 < \gamma \|v_0\|^2 for a default threshold γ=1/32\gamma=1/32. Pseudocode details include:

  • Mini-batch size bb (default b=100b=100 or b0.1nb \approx 0.1n),
  • Exponential smoothing parameter β\beta (default $0.999$),
  • No explicit requirement for full knowledge of global LL or μ\mu,
  • Updates using autodiff for efficient computation of ξt(0)\xi_t'(0) and ξt(0)\xi_t''(0).

The algorithm’s stopping rule for the inner loop is identical in spirit to the SARAH⁺ variant, leveraging the observed decay in the squared norm of vtv_t to avoid unnecessary iterations and maintain variance reduction efficiency (Nguyen et al., 2017).

4. Theoretical Convergence Properties

Under strong convexity, with PP μ\mu-strongly convex and each fif_i convex (and using LitL_i^t-local smoothness), AI-SARAH achieves linearly decaying expected gradient norm per outer epoch:

E[P(w~k)2](=1kσm)P(w~0)2,\mathbb{E}[\|\nabla P(\tilde w_k)\|^2] \leq \left( \prod_{\ell=1}^k \sigma_m^\ell \right) \|\nabla P(\tilde w_0)\|^2,

where

σmk=1μH+(η0L0)/(2η0L0),\sigma_m^k = \frac{1}{\mu \mathcal{H} + (\eta_0 L^0)/(2 - \eta_0 L^0)},

and total inner-loop step-sum H=t=0mηt\mathcal{H} = \sum_{t=0}^m \eta_t, with ηt=1/Lt\eta_t = 1/L^t. When local smoothness LtL^t is much less than the global LL, H\mathcal{H} is larger, leading to potentially faster convergence compared to standard SARAH. In particular, the rate recovers the classical SARAH rate when Lit=LL_i^t = L but improves upon it where local geometry allows (Shi et al., 2021).

5. Computational Complexity

Each AI-SARAH inner iteration consists of:

  • One mini-batch gradient at wt1w_{t-1} and wtw_t (O(bd)O(bd)),
  • Two extra directional derivatives via automatic differentiation for ξt(0)\xi_t'(0) and ξt(0)\xi_t''(0) (O(bd)O(bd) each).

Thus, per-inner-iteration cost is approximately $3b$ equivalent stochastic gradients, compared to bb for base SARAH. The outer full gradient P(w0)\nabla P(w_0) is computed once per epoch (O(nd)O(nd)). For ϵ\epsilon-accuracy in the strongly convex case, both AI-SARAH and classical SARAH require O((n+n)log(1/ϵ))O((n + n) \log(1/\epsilon)) gradient-equivalents due to mn/bm \approx n/b in practical parameterization. The factor-of-3 cost is offset by the improved adaptation and lack of hyperparameter tuning (Shi et al., 2021).

6. Empirical Performance and Robustness

Extensive experiments on logistic regression tasks, both regularized and unregularized, were conducted on 10 LIBSVM datasets (e.g., ijcnn1, rcv1, news20, gisette, mushrooms). Competitors included fine-tuned SARAH, SARAH+, SVRG, Adam, and SGD-momentum, with up to 5,000 hyperparameter configurations evaluated per dataset for non-AI-SARAH methods.

Key comparative results:

  • AI-SARAH, with default settings (γ=1/32\gamma=1/32, β=0.999\beta=0.999, b=100b=100), uniformly outperformed or matched the best tuned SARAH, SARAH+, and SVRG algorithms in convergence per effective pass and wall-clock time.
  • AI-SARAH was competitive or faster than Adam and SGD-momentum in reaching lower gradient norms on convex problems.
  • AI-SARAH exhibited robust performance across all problems with no per-dataset tuning.

An example comparison table for final P(w)2\|\nabla P(w)\|^2 after 20 passes (regularized setting):

Dataset AI-SARAH Best SARAH Best SVRG Best Adam Best SGD-m
ijcnn1 1.2e–6 8.7e–6 1.0e–5 3.5e–6 5.1e–6
rcv1 2.3e–7 1.5e–6 2.1e–6 4.4e–7 8.9e–7

Values represent mean outcomes over 10 seeds, indicating that AI-SARAH converges more rapidly without requiring hyperparameter tuning (Shi et al., 2021).

7. Implementation Guidelines and Practical Considerations

Recommended implementation details for AI-SARAH include:

  • Default parameters (γ=1/32\gamma=1/32, β=0.999\beta=0.999, b=100b=100 or $0.1n$) are robust across tasks.
  • The subproblem for step-size can be solved via a one-step Newton update at α=0\alpha=0, evaluable by two backward-mode autodiff passes; for quadratic loss, a closed form is available.
  • Exponential smoothing on 1/α~1/\tilde\alpha is used to update αmax\alpha_{\max}; i.e., αmax1/(βδold+(1β)/α~)\alpha_{\max} \leftarrow 1/(\beta \delta_{\text{old}} + (1-\beta)/\tilde\alpha).
  • Stopping the inner loop upon vt2<γv02\|v_t\|^2 < \gamma\|v_0\|^2 or after m2n/bm \approx 2n/b steps.
  • No requirement to estimate global LL or μ\mu.
  • Efficient integration into deep learning frameworks (e.g., PyTorch, TensorFlow) by treating the step-size α\alpha as a leaf variable in the computational graph and computing required derivatives through backpropagation (Shi et al., 2021).

8. Relation to SARAH, SARAH⁺, and iSARAH

AI-SARAH shares the stochastic recursive gradient paradigm with SARAH, which achieves linear convergence under strong convexity and offers the unique property that the inner-loop estimator vtv_t exhibits linear convergence in expectation within a single loop (Nguyen et al., 2017). SARAH⁺ introduces adaptive inner-loop termination based on vtv_t norm decay, a design inherited by AI-SARAH. The inexact SARAH (iSARAH) generalizes the methodology to expectation minimization problems beyond finite sums by using stochastic, rather than exact, full gradients, and adjusts batch sizes per theoretical analysis (Nguyen et al., 2018).

The defining advancement in AI-SARAH lies in automating step-size adjustment guided by local geometry—rendering it entirely tune-free in practice—while preserving the memory efficiency and convergence guarantees characteristic of the SARAH lineage.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AI-SARAH.