Papers
Topics
Authors
Recent
Search
2000 character limit reached

Doubly Stochastic Variational Inference

Updated 3 February 2026
  • Doubly Stochastic Variational Inference is an optimization framework that uses mini-batch subsampling and Monte Carlo sampling to approximate intractable latent expectations.
  • It enables scalable Bayesian inference for nonconjugate and high-dimensional models, including deep generative models and Gaussian processes.
  • Variance reduction techniques such as Rao–Blackwellization and amortized control variates improve the efficiency and precision of stochastic gradient estimates in DSVI.

Doubly stochastic variational inference (DSVI) is an optimization framework for variational inference in probabilistic models with intractable per-data-point latent expectations and massive datasets. DSVI uses two layers of stochastic approximation: (1) mini-batch subsampling of the dataset to estimate expectations over observations, and (2) Monte Carlo sampling from variational or model distributions to approximate otherwise intractable expectations. This approach enables scalable approximate Bayesian inference for nonconjugate and high-dimensional latent variable models, including deep generative models, Bayesian nonparametrics, and privacy-preserving inference with gradient perturbations.

1. Concept and Theoretical Motivation

In standard variational inference, one introduces a family of tractable densities qϕ(z)q_\phi(z) to approximate an intractable posterior p(zy)p(z\mid y), optimizing the Evidence Lower Bound (ELBO): L(ϕ)=Eqϕ(z)[logp(y,z)]Eqϕ(z)[logqϕ(z)]=Eqϕ(z)[f(y,z)]\mathcal{L}(\phi) = \mathbb{E}_{q_\phi(z)}\bigl[\log p(y,z)\bigr] - \mathbb{E}_{q_\phi(z)}\bigl[\log q_\phi(z)\bigr] = \mathbb{E}_{q_\phi(z)}[f(y,z)] where f(y,z)=logp(y,z)logqϕ(z)f(y,z)=\log p(y,z)-\log q_\phi(z). When both the dataset is large and the expectations over qϕ(z)q_\phi(z) are not analytically tractable (e.g., in nonconjugate models), direct gradient computations become infeasible. DSVI replaces both expectations with two Monte Carlo estimators: (i) random mini-batch sampling of the data, (ii) sampling from qϕ(z)q_\phi(z). The resulting stochastic gradient estimator for ϕL\nabla_\phi\mathcal{L} has independent noise from both sources: ϕL1Bb=1B1Ss=1Sf(yb,z(s))ϕlogqϕ(z(s))\nabla_\phi \mathcal{L} \,\approx\, \frac{1}{B}\sum_{b=1}^B \frac{1}{S}\sum_{s=1}^S f\bigl(y_b,z^{(s)}\bigr)\,\nabla_\phi\log q_\phi\bigl(z^{(s)}\bigr) where z(s)qϕ(z)z^{(s)}\sim q_\phi(z) and yby_b are mini-batch elements.

This doubly stochastic approach produces unbiased (in expectation) but noisy gradient estimates—suitable for stochastic optimization and compatible with modern automatic differentiation frameworks (Titsias, 2015).

2. Algorithmic Formulation and Gradient Estimation

The DSVI paradigm is instantiated in various model architectures by combining mini-batch data subsampling with MC estimation. For instance, in deep latent variable models (e.g., variational autoencoders, deep Gaussian processes), the computational structure typically takes the form: F(θ)=1ni=1nfi(θ),fi(θ)=Ezpi(z)[i(θ;z)]F(\theta) = \frac{1}{n} \sum_{i=1}^n f_i(\theta) ,\quad f_i(\theta) = \mathbb{E}_{z\sim p_i(z)}[\ell_i(\theta;z)] where fif_i is an intractable expectation. At each iteration:

  • A random batch BtB_t of size bb is sampled from {1,,n}\{1,\dots,n\}.
  • For each iBti\in B_t, mm MC samples zi,1,,zi,mpi()z_{i,1},\dots,z_{i,m}\sim p_i(\cdot) are drawn.
  • The gradient estimator is formed as

gt=1biBt[1mj=1mθi(θt;zi,j)]g_t = \frac{1}{b} \sum_{i\in B_t} \left[ \frac{1}{m} \sum_{j=1}^m \nabla_\theta \ell_i(\theta_t;z_{i,j}) \right]

and the parameter update is θt+1=θtγgt\theta_{t+1} = \theta_t - \gamma g_t (Kim et al., 2024, Titsias, 2015).

The key properties of this estimator are:

  • Unbiasedness: E[gt]=F(θt)\mathbb{E}[g_t] = \nabla F(\theta_t).
  • Variance decomposes into MC noise Varz\operatorname{Var}_{z} and mini-batch subsampling noise VarB\operatorname{Var}_B (Boustati et al., 2020, Kim et al., 2024).
  • Modern theory provides non-asymptotic convergence rates for DSVI under ER (expected residual) and BV (bounded variance) assumptions.

3. Variance Reduction Techniques

The convergence of DSVI critically depends on the variance of its stochastic gradients. Several mechanisms have been developed to reduce this variance:

  • Rao–Blackwellization: For conditionally factorized variational families, local expectation gradients (LeGrad) analytically integrate out a single latent variable conditional on all others, yielding strictly lower or equal variance compared to naive MC (“score function”) estimators and even the reparameterization trick in some cases (Titsias, 2015). The LeGrad update leverages the law of iterated expectations:

ϕiL=Eqϕ(zi)[Eqϕi(zimbi)[f(y,z)ϕilogqϕi(zipai)]]\nabla_{\phi_i}\mathcal{L} = \mathbb{E}_{q_\phi(z_{\setminus i})} \left[ \mathbb{E}_{q_{\phi_i}(z_i|\mathrm{mb}_i)} \bigl[ f(y,z)\nabla_{\phi_i}\log q_{\phi_i}(z_i|\mathrm{pa}_i) \bigr] \right]

Empirically, this achieves up to an order of magnitude variance reduction.

  • Amortized Control Variates: Neural networks rϕr_\phi can be trained online to output data-dependent control variate weights cbic_{bi}, resulting in a variance-reduced gradient estimator per component. This can drive the normalized variance of the minibatch gradient down to $0.2$–$0.3$ of the original (Boustati et al., 2020).
  • Batch Size versus MC Simulation: For a fixed computational budget b×mb \times m, variance analysis indicates that increasing the data minibatch size bb is more effective than increasing the number of MC samples mm, especially when MC samples for different batch elements are correlated. Using m=1m=1 is frequently sufficient (Kim et al., 2024).
  • Random Reshuffling: Replacing independent mini-batch sampling with random reshuffling at each epoch reduces the impact of subsampling noise and yields improved asymptotic complexity (Kim et al., 2024).

4. Applications in Gaussian Process Models

DSVI is foundational for scalable variational inference in both Gaussian Process Latent Variable Models (GPLVMs) and deep Gaussian processes:

  • Bayesian GPLVM: The DSVI procedure involves mini-batch subsampling to approximate the ELBO and MC sampling to compute kernel statistics ψ1n\psi_{1n}, ψ2n\psi_{2n}, ψ0n\psi_{0n} for each latent xnx_n. The overall algorithm scales as O((B+M)M2D)O((B+M)M^2D) per step, where BB is batch size and MM is the number of inducing points. Empirically, batch sizes B100B \sim 100–$500$, single-sample J=1J=1 MC draws, and M0.1NM \sim 0.1N induce strong performance on high-dimensional data with massive missingness and automatic relevance determination properties (Lalchand et al., 2022).
  • Deep Gaussian Processes: DSVI enables efficient fitting by drawing both mini-batch data and reparameterized latent paths through multiple GP layers. The computational cost is O(BLM2+LM3)O(BLM^2+LM^3) per step, and the method enables inference on datasets with NN up to 10910^9 (Salimbeni et al., 2017). Deeper architectures (DGPs with 3–5 layers) consistently surpass shallow GP baselines across regression and classification benchmarks.

5. Differential Privacy via DSVI

Differentially private variational inference can be implemented by integrating gradient clipping and noise perturbation into the DSVI optimization loop. The per-example gradient contributions in each mini-batch are clipped to a maximum 2\ell_2 norm ctc_t, summed, and additive Gaussian noise with variance proportional to ct2σ2c_t^2\sigma^2 is added. Privacy amplification by subsampling further reduces cumulative privacy costs per step. Empirically, this approach matches non-private accuracy to within a few percent on Bayesian logistic regression and achieves strong privacy guarantees without the inefficiencies of DP-SGLD or sampling-based methods (Jälkö et al., 2016).

6. Practical Recommendations and Convergence Results

Theoretical analysis and empirical studies yield the following recommendations:

  • Maximize batch size bb within hardware constraints; set MC sample size m=1m=1 (or as small as possible without incurring high MC variance).
  • Use random reshuffling over the dataset each epoch when feasible.
  • Select moderate gradient clipping thresholds for DP-DSVI; small ctc_t induces bias, large ctc_t increases noise.
  • In non-private DSVI, diminishing or adaptive step size schedules (e.g., Adam) ensure convergence to stationary points of the ELBO under standard stochastic approximation conditions.
  • Monitor both MC and mini-batch variance sources to optimize the gradient estimator trade-offs (Kim et al., 2024, Titsias, 2015, Lalchand et al., 2022).

DSVI unifies and generalizes several previous stochastic inference tools:

  • Score-function / REINFORCE: Handles arbitrary (discrete/continuous) qϕq_\phi but suffers from high variance; typically requires control variates.
  • Reparameterization trick: Lower variance for reparameterizable continuous models; not directly applicable to discrete or non-location-scale families.
  • LeGrad: As general as score-function, systematically achieves lower variance by exact local marginalization, and is trivially parallelizable. Does not require additional control variates and can be integrated as a drop-in replacement into autodiff toolkits (Titsias, 2015).

The table summarizes key distinctions among algorithms:

Estimator Model Family Support Variance Control
Score-function (REINFORCE) Discrete + continuous High; needs control vars
Reparameterization Reparameterizable continuous Low; limited generality
Local Expectation Gradients (LeGrad) Discrete + continuous Systematically low

Systematic variance reduction, compatibility with GPUs and large-scale autodiff environments, and applicability to nonconjugate, high-dimensional latent structures distinguish DSVI as the foundation for modern scalable Bayesian machine learning (Titsias, 2015, Lalchand et al., 2022, Salimbeni et al., 2017, Boustati et al., 2020, Kim et al., 2024, Jälkö et al., 2016).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Doubly Stochastic Variational Inference.