Variance-Reduced SGHMC
- Variance-reduced SGHMC is a class of methods that integrates stochastic gradients with techniques like SVRG and SAGA to lower variance in Bayesian sampling.
- These methods improve convergence rates and offer tighter theoretical guarantees by reducing the variance inherent in minibatch approximations.
- Empirical studies demonstrate faster convergence and robust performance in tasks such as regression and neural network inference compared to standard SGHMC.
Variance-reduced Stochastic Gradient Hamiltonian Monte Carlo (VR-SGHMC) refers to a class of methods that enhance the efficiency and scalability of Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) by incorporating variance reduction schemes, primarily those developed within the stochastic optimization literature. These algorithms enable efficient Bayesian inference for large datasets by addressing the inherent variance in stochastic gradient approximations, thereby improving both theoretical convergence guarantees and empirical performance in high-dimensional or nonconvex Bayesian models (Zou et al., 2018, Li et al., 2018, Chen et al., 2017, Hu et al., 2021).
1. Theoretical Background and Motivation
SGHMC algorithms are discretizations of underdamped Langevin dynamics or Hamiltonian systems used for posterior sampling. Conventional HMC employs the exact gradient of the negative log-posterior , which becomes infeasible for large-scale problems. SGHMC replaces the full gradient with a stochastic estimator , generally computed over minibatches. However, naive stochastic gradients exhibit high variance, resulting in slow mixing, poor approximation to the posterior, and degraded empirical risk.
Variance-reduction methods, such as SVRG and SAGA, originally developed for stochastic optimization, construct control variates by leveraging either a full-gradient snapshot or a table of component-wise gradients. In the SGHMC context, these methods produce unbiased or biased gradient estimates with guaranteed lower variance, directly improving the quality of the Markov chain both in theory and practice (Zou et al., 2018, Li et al., 2018, Chen et al., 2017, Hu et al., 2021).
2. Variance-Reduction Schemes in SGHMC
Variance-reduced SGHMC methods typically employ one of four estimators:
| Estimator | Bias | Mechanism |
|---|---|---|
| SVRG | Unbiased | Full-gradient snapshot, epoch-wise |
| SAGA | Unbiased | Per-component table, per-iteration |
| SARAH | Biased | Recursive single-pass estimator |
| SARGE | Biased | Per-component recursive memory |
- SVRG: Maintains a reference (snapshot) point at which the full gradient is computed. Within each epoch, stochastic gradients are formed as
where is a minibatch of size (Hu et al., 2021, Li et al., 2018, Zou et al., 2018).
- SAGA: Maintains a table of most recent component gradients and, at each step, updates entries for a random batch while forming
(Hu et al., 2021, Li et al., 2018).
- SARAH/SARGE: Deploy recursive estimators with single-pass or table-based memory. They are biased in general but can exhibit smaller mean-squared-error (MSE) in certain regimes (Hu et al., 2021).
These estimators are seamlessly embedded in the SGHMC integrator, replacing the naive minibatch gradient.
3. Algorithmic Frameworks
The core dynamics for variance-reduced SGHMC are based on the discretized underdamped Langevin SDE:
where is the friction parameter and is a -dimensional Wiener process. Naive Euler discretization with a variance-reduced gradient at each step leads to updates of the form:
is a standard normal vector.
Higher-order splitting schemes, notably symmetric splitting, further decrease discretization bias, as established in (Li et al., 2018). In symmetric splitting, each iteration is decomposed as:
which improves the bias from to in the MSE bound (Li et al., 2018).
4. Convergence Theory and Gradient Complexity
Central to the analysis of VR-SGHMC are precise non-asymptotic bounds on mean-square error (MSE) and 2-Wasserstein distance to the target posterior. The bias and variance terms are governed by the properties of the gradient estimator—particularly through the Mean-Squared-Error-Bias (MSEB) property (Hu et al., 2021).
- Unbiased estimators (SVRG, SAGA): Achieve, for strongly log-concave posteriors, gradient complexity of
to reach -accuracy in 2-Wasserstein distance, substantially improving over standard SGHMC and full-gradient HMC in practical regimes (Zou et al., 2018, Hu et al., 2021).
- Biased estimators (SARAH, SARGE): Reduce the -dependence in gradient complexity, achieving
but with a weaker dependency (Hu et al., 2021).
The table summarizes regime differences:
| Method | Gradient Complexity | Bias | Best Regime |
|---|---|---|---|
| SVRG-HMC, SAGA | Unbiased | High-precision, moderate | |
| SARAH, SARGE | Biased | Moderate-accuracy, large |
The bounds depend explicitly on the smoothness , strong convexity , minibatch size , and tuning (snapshot interval ) (Zou et al., 2018, Hu et al., 2021).
5. Hyperparameter Selection and Practical Aspects
Empirical and theoretical recommendations for hyperparameters are as follows:
- Minibatch size (): Modest, e.g., is typical for SVRG/SAGA schemes (Li et al., 2018).
- Snapshot interval (, ): For SVRG, , typically , for two-batch control variates (Chen et al., 2017).
- Step size (): Symmetric splitting allows larger ; must be satisfied (Li et al., 2018).
- Friction (, ): ; larger leads to more rapid velocity dissipation.
- Control variate batch () and online batch (): –$100$, , with update every iterations (Chen et al., 2017).
Pragmatic implementation replaces the SGHMC gradient with the control-variate estimate; memory and computation scale with the size of auxiliary tables for SAGA-type methods.
6. Empirical Performance and Applications
Experimental results demonstrate uniform acceleration of convergence and reduction in estimator variance for VR-SGHMC methods compared to standard SGHMC and Langevin approaches. In large-scale Bayesian regression, classification, and neural network inference tasks, SVRG2nd-HMC and its SAGA/HMC variants exhibit the fastest convergence in both training and test metrics. Key metrics include test mean-squared error (MSE), test negative log-likelihood, and root-MSE for Bayesian neural networks.
- On the UCI “concrete” dataset: after 5 data-passes,
- SGHMC test MSE
- SVRG-HMC test MSE
- SVRG2nd-HMC test MSE
- On the “protein” dataset (BNN, after 2 passes):
- SGHMC test RMSE
- SVRG2nd-HMC test RMSE (Li et al., 2018).
Speedups of – in convergence are observed across Bayesian regression, classification, deep neural networks (MLP, CNN, ResNet), and LLMs (Chen et al., 2017, Li et al., 2018). Variance reduction also yields smoother learning curves and more robust out-of-sample performance.
7. Comparison and Extensions
Variance-reduced SGHMC variants outperform traditional stochastic gradient MCMC methods (SGHMC, SGLD, VR-SGLD) across a wide range of regimes in both theory and practice.
- For strongly convex log-posteriors, VR-SGHMC achieves a mixed regime complexity strictly better than standard HMC or SGHMC except in rare limits (i.e., extremely large ) (Zou et al., 2018).
- Higher-order symmetric splitting can further accelerate mixing and reduce discretization error to (Li et al., 2018).
- For general (non-strongly convex) log-concave targets, extensions use quadratic regularization and achieve comparable bounds modulo -dependent terms (Zou et al., 2018).
Unbiased (SVRG/SAGA) and biased (SARAH/SARGE) VR-SGHMC schemes entail a trade-off between asymptotic bias and mean-square error; unbiased methods excel in high-precision regimes, while biased estimators are more attractive when moderate precision or very large data is paramount (Hu et al., 2021).
References
- "Stochastic Variance-Reduced Hamilton Monte Carlo Methods" (Zou et al., 2018)
- "Stochastic Gradient Hamiltonian Monte Carlo with Variance Reduction for Bayesian Inference" (Li et al., 2018)
- "A Convergence Analysis for A Class of Practical Variance-Reduction Stochastic Gradient MCMC" (Chen et al., 2017)
- "A New Framework for Variance-Reduced Hamiltonian Monte Carlo" (Hu et al., 2021)