Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stochastic Natural-Gradient SSVI

Updated 22 February 2026
  • The paper introduces a scalable Bayesian inference technique that merges natural gradient optimization with stochastic variational inference, achieving provable convergence in conjugate models.
  • It employs gradient smoothing and adaptive step sizes to reduce variance and enhance convergence rates in large-scale, nonconjugate settings.
  • Practical implementations leverage Cholesky parameterizations and block sparsity to ensure stability, efficiency, and adherence to statistical constraints.

Stochastic Natural-Gradient SSVI is a scalable Bayesian inference methodology that integrates natural-gradient optimization with stochastic variational inference (SVI), offering provable convergence guarantees in conjugate models and substantial empirical speed-ups in nonconjugate or large-scale settings. The core contributions and theoretical foundations are delineated below.

1. ELBO Optimization and Natural Gradient Foundations

The objective in variational inference is to maximize the evidence lower bound (ELBO): L(λ)=Eq(z;λ)[logp(y,z)]Eq(z;λ)[logq(z;λ)],\mathcal{L}(\lambda) = \mathbb{E}_{q(z;\lambda)}[\log p(y, z)] - \mathbb{E}_{q(z;\lambda)}[\log q(z;\lambda)]\,, where q(z;λ)q(z;\lambda) is a variational family over latent variables zz, parameterized by λ\lambda. In exponential family models, q(;λ)q(\cdot;\lambda) defines a Riemannian geometry via its Fisher information matrix,

F(λ)=Eq(z;λ)[λlogq(z;λ)λlogq(z;λ)]=Eq[λ2logq].F(\lambda) = \mathbb{E}_{q(z;\lambda)}[\nabla_\lambda \log q(z;\lambda)\nabla_\lambda \log q(z;\lambda)^\top] = -\mathbb{E}_q[\nabla^2_{\lambda} \log q]\,.

The natural gradient preconditions the standard gradient ascent by F(λ)1F(\lambda)^{-1}, yielding steepest ascent in the Fisher information geometry: ~λL(λ)=F(λ)1λL(λ).\widetilde{\nabla}_\lambda\mathcal{L}(\lambda) = F(\lambda)^{-1} \nabla_\lambda \mathcal{L}(\lambda)\,. This direction adapts dynamically to the local curvature of the statistical model, improving convergence rates and stability, especially in high-dimensional settings (Wu et al., 2024, Mandt et al., 2014, Tan, 2021).

2. Stochastic Natural-Gradient Update and Smoothing

For large datasets or intractable expectations in ELBO, stochastic estimation is used: ^λL(λt)=G^t(λtλp),\widehat{\nabla}_\lambda\mathcal{L}(\lambda_t) = \widehat{G}_t - (\lambda_t - \lambda_p)\,, where G^t\widehat{G}_t is an unbiased estimator of λEq[logp(yz)]\nabla_\lambda \mathbb{E}_q[\log p(y|z)], and λp\lambda_p is the prior's natural parameter. The update becomes: λt+1=λt+ρtF(λt)1^λL(λt).\lambda_{t+1} = \lambda_t + \rho_t F(\lambda_t)^{-1} \widehat{\nabla}_\lambda\mathcal{L}(\lambda_t)\,. Smoothed SSVI as proposed in (Mandt et al., 2014) replaces the instantaneous stochastic gradient with a moving average over the past LL minibatches: St=1Lj=0L1S^tj,gt=(ηλt)+St,\overline{S}_t = \frac{1}{L} \sum_{j=0}^{L-1} \widehat{S}_{t-j}, \qquad \overline{g}_t = (\eta - \lambda_t) + \overline{S}_t\,, yielding a variance reduction of approximately $1/L$ but introducing bias as λt\lambda_t drifts from previous values. The mean squared error of the estimator decomposes into variance and squared-bias terms, shaping the classical bias-variance tradeoff (Mandt et al., 2014).

3. Non-Asymptotic Convergence Rates and Geometry

In conjugate exponential-family models, the mirror-descent update admits a non-asymptotic O(1/T)\mathcal{O}(1/T) convergence rate with a step schedule ρt=2/(t+2)\rho_t=2/(t+2), provided the ELBO is 1-smooth and 1-strongly convex relative to the Bregman divergence induced by the dual log-partition function AA^*. The error after TT steps is bounded by

E[KL(qT+1q)]VT+2\mathbb{E}[\mathrm{KL}(\overline{q}_{T+1}\|q^*)] \leq \frac{V}{T+2}

where VV bounds the “mirror” variance and qT+1\overline{q}_{T+1} is the weighted iterate average (Wu et al., 2024). This provides the first O(1/T)\mathcal{O}(1/T) nonasymptotic rate for stochastic NGVI in the conjugate setting.

For nonconjugate likelihoods (e.g., logistic regression, Poisson regression), the ELBO in expectation-parameter space can exhibit nonconvexity, invalidating mirror-descent descent properties: (ω)=L(λ(ω))\ell(\omega) = -\mathcal{L}(\lambda(\omega)) may lack global relative strong convexity, and negative eigenvalues in the Hessian arise, precluding standard O(1/T)\mathcal{O}(1/T) guarantees absent additional global geometric conditions (Wu et al., 2024). A plausible implication is that for nonconjugate models, only local convergence or heuristic progress is ensured.

4. Implementation in Exponential and Gaussian Families

Practical SSVI implementations avoid computing F(λ)1F(\lambda)^{-1} explicitly:

  • Expectation Parameterization: For exponential families,

F(λ)1λL(λ)=ωL(ω)F(\lambda)^{-1} \nabla_\lambda\mathcal{L}(\lambda) = \nabla_\omega \mathcal{L}(\omega)

where ω\omega denotes expectation parameters. The update reduces to mirror descent in ω\omega-space; map forward to λ\lambda via the dual gradient λt+1=A(ωt+1)\lambda_{t+1} = \nabla A^*(\omega_{t+1}) (Wu et al., 2024).

  • Cholesky Factor for Gaussian Approximation: Parameterizing q(θ)=N(μ,Σ)q(\theta)=\mathcal{N}(\mu,\Sigma) via its Cholesky factor CC guarantees SPD updates:

μ(t+1)=μ(t)+ρtΣ(t)^θh(θ),C(t+1)=C(t)+ρtC(t)H(t),\mu^{(t+1)} = \mu^{(t)} + \rho_t \Sigma^{(t)}\widehat{\nabla}_\theta h(\theta), \qquad C^{(t+1)} = C^{(t)} + \rho_t C^{(t)} H^{(t)}\,,

with H(t)=C(t)GC(t)H^{(t)}=C^{(t)^\top} \overline{G_C^{(t)}}, where GCG_C is a stochastic estimate of the ELBO gradient w.r.t. CC (Tan, 2021). SPD and sparsity are automatically preserved.

  • Block and Sparse Structure: Natural-gradient updates decompose naturally in block-diagonal or sparsity-constrained Cholesky parametrizations, maintaining statistical efficiency and computational tractability (Tan, 2021).

5. Smoothing, Momentum, and Practical Considerations

  • Gradient Smoothing: Moving-average smoothing of sufficient statistics or gradients is computationally light (additional O(L dimλ)\mathcal{O}(L\ \mathrm{dim}{\lambda}) memory), introduces a tunable bias-variance compromise, and empirically accelerates convergence by attenuating stochastic fluctuations (Mandt et al., 2014).
  • Step Size Schedules: In conjugate models, diminishing schedules (ρt=2/(t+2)\rho_t=2/(t+2)) are crucial for provable rates. For nonconjugate settings, small fixed or gently decaying step sizes are often effective in practice (Wu et al., 2024).
  • Momentum and Normalization: Stochastic normalized natural-gradient ascent with momentum (Snngm) outperforms Adam-style Euclidean methods in speed and quality of variational approximations. Empirically, momentum β=0.9\beta=0.9–$0.99$ is robust, and step size should scale with dim(λ)\sqrt{\operatorname{dim}(\lambda)} (Tan, 2021).
  • Gradient Estimation: For Gaussians, Bonnet–Price gradient estimators guarantee that the covariance-gradient remains negative definite in log-concave likelihood models, supporting domain-respecting updates (Wu et al., 2024).
  • Parameter Validity: Maintaining membership in the feasible domain (e.g., positive definite covariances) may require projection or constraint via SVD or Cholesky clamping, especially outside exponential family settings (Wu et al., 2024, Tan, 2021).

6. Algorithmic Description

The generic Stochastic Natural-Gradient SSVI algorithm proceeds as follows (Wu et al., 2024, Mandt et al., 2014, Tan, 2021):

  • Initialize λ0\lambda_0 (or expectation parameter ω0\omega_0).
  • For t=0,,T1t=0,\dots,T-1:
    • λt+1=λt+ρtF(λt)1[G^t(λtλp)]\lambda_{t+1} = \lambda_t + \rho_t F(\lambda_t)^{-1}[\widehat{G}_t - (\lambda_t-\lambda_p)]\quad or
    • ωt+1=ωtρtG^t\omega_{t+1} = \omega_t - \rho_t \widehat{G}_t, then λt+1=A(ωt+1)\lambda_{t+1} = \nabla A^*(\omega_{t+1}).
    • 4. Enforce validity of updated parameters (e.g., SPD for covariance).
    • 5. (Optional) Replace the stochastic gradient with its running average over window LL for SSVI (Mandt et al., 2014).
  • In Gaussian families, update μ\mu and CC as above, directly ensuring SPD and incorporating momentum for further acceleration (Tan, 2021).

7. Empirical Results and Theoretical Significance

Experiments on Latent Dirichlet Allocation, generalized linear mixed models, and deep neural networks indicate that:

  • SSVI with window sizes L=10L=10–$100$ substantially reduces variance and achieves faster increases in predictive likelihood than unsmoothed SVI (Mandt et al., 2014).
  • Snngm and momentum-based natural-gradient variants converge in orders of magnitude fewer epochs than Euclidean optimizers such as Adam, often producing superior ELBO values (Tan, 2021).
  • Block or sparse precision structures extend natural-gradient updates to highly structured posteriors at minimal computational overhead (Tan, 2021).
  • The proven O(1/T)\mathcal{O}(1/T) rate in conjugate models establishes SSVI’s theoretical soundness, yet for nonconjugate models, global convergence is generally unguaranteed without further geometric assumptions (Wu et al., 2024).

In summary, Stochastic Natural-Gradient SSVI unites the geometric adaptivity of natural-gradient optimization with the scalability of SVI. It achieves superior convergence rates and enables robust, scalable Bayesian inference under exponential family models, with implementation flexibility for smoothing, Cholesky parametrizations, block sparsity, and momentum (Wu et al., 2024, Mandt et al., 2014, Tan, 2021).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stochastic Natural-Gradient SSVI.