Stochastic Natural-Gradient SSVI

Updated 22 February 2026

The paper introduces a scalable Bayesian inference technique that merges natural gradient optimization with stochastic variational inference, achieving provable convergence in conjugate models.
It employs gradient smoothing and adaptive step sizes to reduce variance and enhance convergence rates in large-scale, nonconjugate settings.
Practical implementations leverage Cholesky parameterizations and block sparsity to ensure stability, efficiency, and adherence to statistical constraints.

Stochastic Natural-Gradient SSVI is a scalable Bayesian inference methodology that integrates natural-gradient optimization with stochastic variational inference (SVI), offering provable convergence guarantees in conjugate models and substantial empirical speed-ups in nonconjugate or large-scale settings. The core contributions and theoretical foundations are delineated below.

1. ELBO Optimization and Natural Gradient Foundations

The objective in variational inference is to maximize the evidence lower bound (ELBO): $\mathcal{L}(\lambda) = \mathbb{E}_{q(z;\lambda)}[\log p(y, z)] - \mathbb{E}_{q(z;\lambda)}[\log q(z;\lambda)]\,,$ where $q(z;\lambda)$ is a variational family over latent variables $z$ , parameterized by $\lambda$ . In exponential family models, $q(\cdot;\lambda)$ defines a Riemannian geometry via its Fisher information matrix,

$F(\lambda) = \mathbb{E}_{q(z;\lambda)}[\nabla_\lambda \log q(z;\lambda)\nabla_\lambda \log q(z;\lambda)^\top] = -\mathbb{E}_q[\nabla^2_{\lambda} \log q]\,.$

The natural gradient preconditions the standard gradient ascent by $F(\lambda)^{-1}$ , yielding steepest ascent in the Fisher information geometry: $\widetilde{\nabla}_\lambda\mathcal{L}(\lambda) = F(\lambda)^{-1} \nabla_\lambda \mathcal{L}(\lambda)\,.$ This direction adapts dynamically to the local curvature of the statistical model, improving convergence rates and stability, especially in high-dimensional settings (Wu et al., 2024, Mandt et al., 2014, Tan, 2021).

2. Stochastic Natural-Gradient Update and Smoothing

For large datasets or intractable expectations in ELBO, stochastic estimation is used: $\widehat{\nabla}_\lambda\mathcal{L}(\lambda_t) = \widehat{G}_t - (\lambda_t - \lambda_p)\,,$ where $\widehat{G}_t$ is an unbiased estimator of $\nabla_\lambda \mathbb{E}_q[\log p(y|z)]$ , and $\lambda_p$ is the prior's natural parameter. The update becomes: $\lambda_{t+1} = \lambda_t + \rho_t F(\lambda_t)^{-1} \widehat{\nabla}_\lambda\mathcal{L}(\lambda_t)\,.$ Smoothed SSVI as proposed in (Mandt et al., 2014) replaces the instantaneous stochastic gradient with a moving average over the past $L$ minibatches: $\overline{S}_t = \frac{1}{L} \sum_{j=0}^{L-1} \widehat{S}_{t-j}, \qquad \overline{g}_t = (\eta - \lambda_t) + \overline{S}_t\,,$ yielding a variance reduction of approximately $1/L$ but introducing bias as $\lambda_t$ drifts from previous values. The mean squared error of the estimator decomposes into variance and squared-bias terms, shaping the classical bias-variance tradeoff (Mandt et al., 2014).

3. Non-Asymptotic Convergence Rates and Geometry

In conjugate exponential-family models, the mirror-descent update admits a non-asymptotic $\mathcal{O}(1/T)$ convergence rate with a step schedule $\rho_t=2/(t+2)$ , provided the ELBO is 1-smooth and 1-strongly convex relative to the Bregman divergence induced by the dual log-partition function $A^*$ . The error after $T$ steps is bounded by

$\mathbb{E}[\mathrm{KL}(\overline{q}_{T+1}\|q^*)] \leq \frac{V}{T+2}$

where $V$ bounds the “mirror” variance and $\overline{q}_{T+1}$ is the weighted iterate average (Wu et al., 2024). This provides the first $\mathcal{O}(1/T)$ nonasymptotic rate for stochastic NGVI in the conjugate setting.

For nonconjugate likelihoods (e.g., logistic regression, Poisson regression), the ELBO in expectation-parameter space can exhibit nonconvexity, invalidating mirror-descent descent properties: $\ell(\omega) = -\mathcal{L}(\lambda(\omega))$ may lack global relative strong convexity, and negative eigenvalues in the Hessian arise, precluding standard $\mathcal{O}(1/T)$ guarantees absent additional global geometric conditions (Wu et al., 2024). A plausible implication is that for nonconjugate models, only local convergence or heuristic progress is ensured.

4. Implementation in Exponential and Gaussian Families

Practical SSVI implementations avoid computing $F(\lambda)^{-1}$ explicitly:

Expectation Parameterization: For exponential families,

$F(\lambda)^{-1} \nabla_\lambda\mathcal{L}(\lambda) = \nabla_\omega \mathcal{L}(\omega)$

where $\omega$ denotes expectation parameters. The update reduces to mirror descent in $\omega$ -space; map forward to $\lambda$ via the dual gradient $\lambda_{t+1} = \nabla A^*(\omega_{t+1})$ (Wu et al., 2024).

Cholesky Factor for Gaussian Approximation: Parameterizing $q(\theta)=\mathcal{N}(\mu,\Sigma)$ via its Cholesky factor $C$ guarantees SPD updates:

$\mu^{(t+1)} = \mu^{(t)} + \rho_t \Sigma^{(t)}\widehat{\nabla}_\theta h(\theta), \qquad C^{(t+1)} = C^{(t)} + \rho_t C^{(t)} H^{(t)}\,,$

with $H^{(t)}=C^{(t)^\top} \overline{G_C^{(t)}}$ , where $G_C$ is a stochastic estimate of the ELBO gradient w.r.t. $C$ (Tan, 2021). SPD and sparsity are automatically preserved.

Block and Sparse Structure: Natural-gradient updates decompose naturally in block-diagonal or sparsity-constrained Cholesky parametrizations, maintaining statistical efficiency and computational tractability (Tan, 2021).

5. Smoothing, Momentum, and Practical Considerations

Gradient Smoothing: Moving-average smoothing of sufficient statistics or gradients is computationally light (additional $\mathcal{O}(L\ \mathrm{dim}{\lambda})$ memory), introduces a tunable bias-variance compromise, and empirically accelerates convergence by attenuating stochastic fluctuations (Mandt et al., 2014).
Step Size Schedules: In conjugate models, diminishing schedules ( $\rho_t=2/(t+2)$ ) are crucial for provable rates. For nonconjugate settings, small fixed or gently decaying step sizes are often effective in practice (Wu et al., 2024).
Momentum and Normalization: Stochastic normalized natural-gradient ascent with momentum (Snngm) outperforms Adam-style Euclidean methods in speed and quality of variational approximations. Empirically, momentum $\beta=0.9$ –$0.99$ is robust, and step size should scale with $\sqrt{\operatorname{dim}(\lambda)}$ (Tan, 2021).
Gradient Estimation: For Gaussians, Bonnet–Price gradient estimators guarantee that the covariance-gradient remains negative definite in log-concave likelihood models, supporting domain-respecting updates (Wu et al., 2024).
Parameter Validity: Maintaining membership in the feasible domain (e.g., positive definite covariances) may require projection or constraint via SVD or Cholesky clamping, especially outside exponential family settings (Wu et al., 2024, Tan, 2021).

6. Algorithmic Description

The generic Stochastic Natural-Gradient SSVI algorithm proceeds as follows (Wu et al., 2024, Mandt et al., 2014, Tan, 2021):

Initialize $\lambda_0$ (or expectation parameter $\omega_0$ ).
For $t=0,\dots,T-1$ $t = 0, \dots, T - 1$ :
- $\lambda_{t+1} = \lambda_t + \rho_t F(\lambda_t)^{-1}[\widehat{G}_t - (\lambda_t-\lambda_p)]$ \quad or
- $\omega_{t+1} = \omega_t - \rho_t \widehat{G}_t$ , then $\lambda_{t+1} = \nabla A^*(\omega_{t+1})$ .
- 4. Enforce validity of updated parameters (e.g., SPD for covariance).
- 5. (Optional) Replace the stochastic gradient with its running average over window $L$ for SSVI (Mandt et al., 2014).
In Gaussian families, update $\mu$ and $C$ as above, directly ensuring SPD and incorporating momentum for further acceleration (Tan, 2021).

7. Empirical Results and Theoretical Significance

Experiments on Latent Dirichlet Allocation, generalized linear mixed models, and deep neural networks indicate that:

SSVI with window sizes $L=10$ –$100$ substantially reduces variance and achieves faster increases in predictive likelihood than unsmoothed SVI (Mandt et al., 2014).
Snngm and momentum-based natural-gradient variants converge in orders of magnitude fewer epochs than Euclidean optimizers such as Adam, often producing superior ELBO values (Tan, 2021).
Block or sparse precision structures extend natural-gradient updates to highly structured posteriors at minimal computational overhead (Tan, 2021).
The proven $\mathcal{O}(1/T)$ rate in conjugate models establishes SSVI’s theoretical soundness, yet for nonconjugate models, global convergence is generally unguaranteed without further geometric assumptions (Wu et al., 2024).

In summary, Stochastic Natural-Gradient SSVI unites the geometric adaptivity of natural-gradient optimization with the scalability of SVI. It achieves superior convergence rates and enables robust, scalable Bayesian inference under exponential family models, with implementation flexibility for smoothing, Cholesky parametrizations, block sparsity, and momentum (Wu et al., 2024, Mandt et al., 2014, Tan, 2021).

Markdown Report Issue Upgrade to Chat

References (3)

Understanding Stochastic Natural Gradient Variational Inference (2024)

Smoothed Gradients for Stochastic Variational Inference (2014)

Analytic natural gradient updates for Cholesky factor in Gaussian variational approximation (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stochastic Natural-Gradient SSVI.

Stochastic Natural-Gradient SSVI

1. ELBO Optimization and Natural Gradient Foundations

2. Stochastic Natural-Gradient Update and Smoothing

3. Non-Asymptotic Convergence Rates and Geometry

4. Implementation in Exponential and Gaussian Families

5. Smoothing, Momentum, and Practical Considerations

6. Algorithmic Description

7. Empirical Results and Theoretical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Stochastic Natural-Gradient SSVI

1. ELBO Optimization and Natural Gradient Foundations

2. Stochastic Natural-Gradient Update and Smoothing

3. Non-Asymptotic Convergence Rates and Geometry

4. Implementation in Exponential and Gaussian Families

5. Smoothing, Momentum, and Practical Considerations

6. Algorithmic Description

7. Empirical Results and Theoretical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research