Doubly Stochastic Variational Inference

Updated 3 February 2026

Doubly Stochastic Variational Inference is an optimization framework that uses mini-batch subsampling and Monte Carlo sampling to approximate intractable latent expectations.
It enables scalable Bayesian inference for nonconjugate and high-dimensional models, including deep generative models and Gaussian processes.
Variance reduction techniques such as Rao–Blackwellization and amortized control variates improve the efficiency and precision of stochastic gradient estimates in DSVI.

Doubly stochastic variational inference (DSVI) is an optimization framework for variational inference in probabilistic models with intractable per-data-point latent expectations and massive datasets. DSVI uses two layers of stochastic approximation: (1) mini-batch subsampling of the dataset to estimate expectations over observations, and (2) Monte Carlo sampling from variational or model distributions to approximate otherwise intractable expectations. This approach enables scalable approximate Bayesian inference for nonconjugate and high-dimensional latent variable models, including deep generative models, Bayesian nonparametrics, and privacy-preserving inference with gradient perturbations.

1. Concept and Theoretical Motivation

In standard variational inference, one introduces a family of tractable densities $q_\phi(z)$ to approximate an intractable posterior $p(z\mid y)$ , optimizing the Evidence Lower Bound (ELBO): $\mathcal{L}(\phi) = \mathbb{E}_{q_\phi(z)}\bigl[\log p(y,z)\bigr] - \mathbb{E}_{q_\phi(z)}\bigl[\log q_\phi(z)\bigr] = \mathbb{E}_{q_\phi(z)}[f(y,z)]$ where $f(y,z)=\log p(y,z)-\log q_\phi(z)$ . When both the dataset is large and the expectations over $q_\phi(z)$ are not analytically tractable (e.g., in nonconjugate models), direct gradient computations become infeasible. DSVI replaces both expectations with two Monte Carlo estimators: (i) random mini-batch sampling of the data, (ii) sampling from $q_\phi(z)$ . The resulting stochastic gradient estimator for $\nabla_\phi\mathcal{L}$ has independent noise from both sources: $\nabla_\phi \mathcal{L} \,\approx\, \frac{1}{B}\sum_{b=1}^B \frac{1}{S}\sum_{s=1}^S f\bigl(y_b,z^{(s)}\bigr)\,\nabla_\phi\log q_\phi\bigl(z^{(s)}\bigr)$ where $z^{(s)}\sim q_\phi(z)$ and $y_b$ are mini-batch elements.

This doubly stochastic approach produces unbiased (in expectation) but noisy gradient estimates—suitable for stochastic optimization and compatible with modern automatic differentiation frameworks (Titsias, 2015).

2. Algorithmic Formulation and Gradient Estimation

The DSVI paradigm is instantiated in various model architectures by combining mini-batch data subsampling with MC estimation. For instance, in deep latent variable models (e.g., variational autoencoders, deep Gaussian processes), the computational structure typically takes the form: $F(\theta) = \frac{1}{n} \sum_{i=1}^n f_i(\theta) ,\quad f_i(\theta) = \mathbb{E}_{z\sim p_i(z)}[\ell_i(\theta;z)]$ where $f_i$ is an intractable expectation. At each iteration:

A random batch $B_t$ of size $b$ is sampled from $\{1,\dots,n\}$ .
For each $i\in B_t$ , $m$ MC samples $z_{i,1},\dots,z_{i,m}\sim p_i(\cdot)$ are drawn.
The gradient estimator is formed as

$g_t = \frac{1}{b} \sum_{i\in B_t} \left[ \frac{1}{m} \sum_{j=1}^m \nabla_\theta \ell_i(\theta_t;z_{i,j}) \right]$

and the parameter update is $\theta_{t+1} = \theta_t - \gamma g_t$ (Kim et al., 2024, Titsias, 2015).

The key properties of this estimator are:

Unbiasedness: $\mathbb{E}[g_t] = \nabla F(\theta_t)$ .
Variance decomposes into MC noise $\operatorname{Var}_{z}$ and mini-batch subsampling noise $\operatorname{Var}_B$ (Boustati et al., 2020, Kim et al., 2024).
Modern theory provides non-asymptotic convergence rates for DSVI under ER (expected residual) and BV (bounded variance) assumptions.

3. Variance Reduction Techniques

The convergence of DSVI critically depends on the variance of its stochastic gradients. Several mechanisms have been developed to reduce this variance:

Rao–Blackwellization: For conditionally factorized variational families, local expectation gradients (LeGrad) analytically integrate out a single latent variable conditional on all others, yielding strictly lower or equal variance compared to naive MC (“score function”) estimators and even the reparameterization trick in some cases (Titsias, 2015). The LeGrad update leverages the law of iterated expectations:

$\nabla_{\phi_i}\mathcal{L} = \mathbb{E}_{q_\phi(z_{\setminus i})} \left[ \mathbb{E}_{q_{\phi_i}(z_i|\mathrm{mb}_i)} \bigl[ f(y,z)\nabla_{\phi_i}\log q_{\phi_i}(z_i|\mathrm{pa}_i) \bigr] \right]$

Empirically, this achieves up to an order of magnitude variance reduction.

Amortized Control Variates: Neural networks $r_\phi$ can be trained online to output data-dependent control variate weights $c_{bi}$ , resulting in a variance-reduced gradient estimator per component. This can drive the normalized variance of the minibatch gradient down to $0.2$–$0.3$ of the original (Boustati et al., 2020).
Batch Size versus MC Simulation: For a fixed computational budget $b \times m$ , variance analysis indicates that increasing the data minibatch size $b$ is more effective than increasing the number of MC samples $m$ , especially when MC samples for different batch elements are correlated. Using $m=1$ is frequently sufficient (Kim et al., 2024).
Random Reshuffling: Replacing independent mini-batch sampling with random reshuffling at each epoch reduces the impact of subsampling noise and yields improved asymptotic complexity (Kim et al., 2024).

4. Applications in Gaussian Process Models

DSVI is foundational for scalable variational inference in both Gaussian Process Latent Variable Models (GPLVMs) and deep Gaussian processes:

Bayesian GPLVM: The DSVI procedure involves mini-batch subsampling to approximate the ELBO and MC sampling to compute kernel statistics $\psi_{1n}$ , $\psi_{2n}$ , $\psi_{0n}$ for each latent $x_n$ . The overall algorithm scales as $O((B+M)M^2D)$ per step, where $B$ is batch size and $M$ is the number of inducing points. Empirically, batch sizes $B \sim 100$ –$500$, single-sample $J=1$ MC draws, and $M \sim 0.1N$ induce strong performance on high-dimensional data with massive missingness and automatic relevance determination properties (Lalchand et al., 2022).
Deep Gaussian Processes: DSVI enables efficient fitting by drawing both mini-batch data and reparameterized latent paths through multiple GP layers. The computational cost is $O(BLM^2+LM^3)$ per step, and the method enables inference on datasets with $N$ up to $10^9$ (Salimbeni et al., 2017). Deeper architectures (DGPs with 3–5 layers) consistently surpass shallow GP baselines across regression and classification benchmarks.

5. Differential Privacy via DSVI

Differentially private variational inference can be implemented by integrating gradient clipping and noise perturbation into the DSVI optimization loop. The per-example gradient contributions in each mini-batch are clipped to a maximum $\ell_2$ norm $c_t$ , summed, and additive Gaussian noise with variance proportional to $c_t^2\sigma^2$ is added. Privacy amplification by subsampling further reduces cumulative privacy costs per step. Empirically, this approach matches non-private accuracy to within a few percent on Bayesian logistic regression and achieves strong privacy guarantees without the inefficiencies of DP-SGLD or sampling-based methods (Jälkö et al., 2016).

6. Practical Recommendations and Convergence Results

Theoretical analysis and empirical studies yield the following recommendations:

Maximize batch size $b$ within hardware constraints; set MC sample size $m=1$ (or as small as possible without incurring high MC variance).
Use random reshuffling over the dataset each epoch when feasible.
Select moderate gradient clipping thresholds for DP-DSVI; small $c_t$ induces bias, large $c_t$ increases noise.
In non-private DSVI, diminishing or adaptive step size schedules (e.g., Adam) ensure convergence to stationary points of the ELBO under standard stochastic approximation conditions.
Monitor both MC and mini-batch variance sources to optimize the gradient estimator trade-offs (Kim et al., 2024, Titsias, 2015, Lalchand et al., 2022).

DSVI unifies and generalizes several previous stochastic inference tools:

Score-function / REINFORCE: Handles arbitrary (discrete/continuous) $q_\phi$ but suffers from high variance; typically requires control variates.
Reparameterization trick: Lower variance for reparameterizable continuous models; not directly applicable to discrete or non-location-scale families.
LeGrad: As general as score-function, systematically achieves lower variance by exact local marginalization, and is trivially parallelizable. Does not require additional control variates and can be integrated as a drop-in replacement into autodiff toolkits (Titsias, 2015).

The table summarizes key distinctions among algorithms:

Estimator	Model Family Support	Variance Control
Score-function (REINFORCE)	Discrete + continuous	High; needs control vars
Reparameterization	Reparameterizable continuous	Low; limited generality
Local Expectation Gradients (LeGrad)	Discrete + continuous	Systematically low

Systematic variance reduction, compatibility with GPUs and large-scale autodiff environments, and applicability to nonconjugate, high-dimensional latent structures distinguish DSVI as the foundation for modern scalable Bayesian machine learning (Titsias, 2015, Lalchand et al., 2022, Salimbeni et al., 2017, Boustati et al., 2020, Kim et al., 2024, Jälkö et al., 2016).