Stochastic Natural-Gradient SSVI
- The paper introduces a scalable Bayesian inference technique that merges natural gradient optimization with stochastic variational inference, achieving provable convergence in conjugate models.
- It employs gradient smoothing and adaptive step sizes to reduce variance and enhance convergence rates in large-scale, nonconjugate settings.
- Practical implementations leverage Cholesky parameterizations and block sparsity to ensure stability, efficiency, and adherence to statistical constraints.
Stochastic Natural-Gradient SSVI is a scalable Bayesian inference methodology that integrates natural-gradient optimization with stochastic variational inference (SVI), offering provable convergence guarantees in conjugate models and substantial empirical speed-ups in nonconjugate or large-scale settings. The core contributions and theoretical foundations are delineated below.
1. ELBO Optimization and Natural Gradient Foundations
The objective in variational inference is to maximize the evidence lower bound (ELBO): where is a variational family over latent variables , parameterized by . In exponential family models, defines a Riemannian geometry via its Fisher information matrix,
The natural gradient preconditions the standard gradient ascent by , yielding steepest ascent in the Fisher information geometry: This direction adapts dynamically to the local curvature of the statistical model, improving convergence rates and stability, especially in high-dimensional settings (Wu et al., 2024, Mandt et al., 2014, Tan, 2021).
2. Stochastic Natural-Gradient Update and Smoothing
For large datasets or intractable expectations in ELBO, stochastic estimation is used: where is an unbiased estimator of , and is the prior's natural parameter. The update becomes: Smoothed SSVI as proposed in (Mandt et al., 2014) replaces the instantaneous stochastic gradient with a moving average over the past minibatches: yielding a variance reduction of approximately $1/L$ but introducing bias as drifts from previous values. The mean squared error of the estimator decomposes into variance and squared-bias terms, shaping the classical bias-variance tradeoff (Mandt et al., 2014).
3. Non-Asymptotic Convergence Rates and Geometry
In conjugate exponential-family models, the mirror-descent update admits a non-asymptotic convergence rate with a step schedule , provided the ELBO is 1-smooth and 1-strongly convex relative to the Bregman divergence induced by the dual log-partition function . The error after steps is bounded by
where bounds the “mirror” variance and is the weighted iterate average (Wu et al., 2024). This provides the first nonasymptotic rate for stochastic NGVI in the conjugate setting.
For nonconjugate likelihoods (e.g., logistic regression, Poisson regression), the ELBO in expectation-parameter space can exhibit nonconvexity, invalidating mirror-descent descent properties: may lack global relative strong convexity, and negative eigenvalues in the Hessian arise, precluding standard guarantees absent additional global geometric conditions (Wu et al., 2024). A plausible implication is that for nonconjugate models, only local convergence or heuristic progress is ensured.
4. Implementation in Exponential and Gaussian Families
Practical SSVI implementations avoid computing explicitly:
- Expectation Parameterization: For exponential families,
where denotes expectation parameters. The update reduces to mirror descent in -space; map forward to via the dual gradient (Wu et al., 2024).
- Cholesky Factor for Gaussian Approximation: Parameterizing via its Cholesky factor guarantees SPD updates:
with , where is a stochastic estimate of the ELBO gradient w.r.t. (Tan, 2021). SPD and sparsity are automatically preserved.
- Block and Sparse Structure: Natural-gradient updates decompose naturally in block-diagonal or sparsity-constrained Cholesky parametrizations, maintaining statistical efficiency and computational tractability (Tan, 2021).
5. Smoothing, Momentum, and Practical Considerations
- Gradient Smoothing: Moving-average smoothing of sufficient statistics or gradients is computationally light (additional memory), introduces a tunable bias-variance compromise, and empirically accelerates convergence by attenuating stochastic fluctuations (Mandt et al., 2014).
- Step Size Schedules: In conjugate models, diminishing schedules () are crucial for provable rates. For nonconjugate settings, small fixed or gently decaying step sizes are often effective in practice (Wu et al., 2024).
- Momentum and Normalization: Stochastic normalized natural-gradient ascent with momentum (Snngm) outperforms Adam-style Euclidean methods in speed and quality of variational approximations. Empirically, momentum –$0.99$ is robust, and step size should scale with (Tan, 2021).
- Gradient Estimation: For Gaussians, Bonnet–Price gradient estimators guarantee that the covariance-gradient remains negative definite in log-concave likelihood models, supporting domain-respecting updates (Wu et al., 2024).
- Parameter Validity: Maintaining membership in the feasible domain (e.g., positive definite covariances) may require projection or constraint via SVD or Cholesky clamping, especially outside exponential family settings (Wu et al., 2024, Tan, 2021).
6. Algorithmic Description
The generic Stochastic Natural-Gradient SSVI algorithm proceeds as follows (Wu et al., 2024, Mandt et al., 2014, Tan, 2021):
- Initialize (or expectation parameter ).
- For :
- \quad or
- , then .
- 4. Enforce validity of updated parameters (e.g., SPD for covariance).
- 5. (Optional) Replace the stochastic gradient with its running average over window for SSVI (Mandt et al., 2014).
- In Gaussian families, update and as above, directly ensuring SPD and incorporating momentum for further acceleration (Tan, 2021).
7. Empirical Results and Theoretical Significance
Experiments on Latent Dirichlet Allocation, generalized linear mixed models, and deep neural networks indicate that:
- SSVI with window sizes –$100$ substantially reduces variance and achieves faster increases in predictive likelihood than unsmoothed SVI (Mandt et al., 2014).
- Snngm and momentum-based natural-gradient variants converge in orders of magnitude fewer epochs than Euclidean optimizers such as Adam, often producing superior ELBO values (Tan, 2021).
- Block or sparse precision structures extend natural-gradient updates to highly structured posteriors at minimal computational overhead (Tan, 2021).
- The proven rate in conjugate models establishes SSVI’s theoretical soundness, yet for nonconjugate models, global convergence is generally unguaranteed without further geometric assumptions (Wu et al., 2024).
In summary, Stochastic Natural-Gradient SSVI unites the geometric adaptivity of natural-gradient optimization with the scalability of SVI. It achieves superior convergence rates and enables robust, scalable Bayesian inference under exponential family models, with implementation flexibility for smoothing, Cholesky parametrizations, block sparsity, and momentum (Wu et al., 2024, Mandt et al., 2014, Tan, 2021).