Unbiased MLMC Gradient Estimators

Updated 10 February 2026

The paper introduces telescoping-sum decompositions to construct unbiased gradient estimates with controlled variance for nested expectation problems.
It details algorithmic strategies using antithetic coupling and randomized level selection to balance cost efficiency and variance reduction.
The analysis demonstrates optimal complexity bounds and showcases applications in variational inference, Bayesian design, and kinetic Langevin sampling.

Unbiased multilevel Monte Carlo (MLMC) gradient estimators are a class of stochastic estimators that use hierarchy-based variance reduction, telescoping-sum decompositions, and level randomization to achieve unbiasedness with finite variance and computational cost, even for nested expectation or intractable objective settings. MLMC estimators are widely used for variational inference, stochastic nested optimization, evidence gradients, Bayesian experimental design, simulation-based inference, and kinetic Langevin sampling with inexact gradients. This entry provides a comprehensive view of the foundational principles, mathematical constructions, algorithmic realizations, complexity bounds, representative applications, and practical limitations grounded in current arXiv literature.

1. Mathematical Foundations of MLMC Gradient Estimation

The unbiased MLMC gradient paradigm exploits telescoping decompositions for expectations or their gradients across a hierarchy of Monte Carlo sample sizes or step discretizations. For an expectation $\nabla F(x)$ represented with an increasing sequence of Monte Carlo proxies $\psi_\ell(x)$ (e.g., inner sample size $M_\ell = 2^\ell$ for nested problems), the telescoping sum

$\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$

expresses the target gradient as the sum of levelwise increments. Under suitable coupling (e.g., antithetic splits of samples), the construction preserves unbiasedness at each level and for the overall estimator (Goda et al., 2022, Ishikawa et al., 2020).

The single-term Rhee–Glynn estimator further introduces randomization for unbiasedness. With probability mass $\omega_\ell$ ( $\sum_l \omega_l = 1$ ), one samples a random level $L$ and constructs

$\widehat{g}(x) = \frac{\Delta_L(x)}{\omega_L}$

to ensure $\mathbb E[\widehat{g}(x)] = \nabla F(x)$ . Cost-efficiency and variance control arise from the exponential decay of $\mathbb E[\|\Delta_\ell(x)\|^2]$ in $\psi_\ell(x)$ 0, permitting overall finite moments when $\psi_\ell(x)$ 1 decays accordingly (Goda et al., 2020, Yang et al., 2024, Asi et al., 2021).

For MLMC in discretization-gradient settings, e.g., kinetic Langevin sampling, $\psi_\ell(x)$ 2 are differences between coupled chains (step-size $\psi_\ell(x)$ 3 vs. $\psi_\ell(x)$ 4), and the estimator is an average of level differences weighted by their randomization probabilities $\psi_\ell(x)$ 5 (Chada et al., 2023).

2. Class of Target Problems and Expressivity

MLMC gradient estimators target expectation gradients in settings where naive MC or fixed-sample nested MC estimators are biased or administratively infeasible. These encompass:

Nested expectation optimization: $\psi_\ell(x)$ 6, appearing in stochastic compositional or Bayesian design objectives (Goda et al., 2022, Goda et al., 2020).
Variational inference with reparameterized gradients: ELBO or KL objectives with traceable stochastic nodes (Fujisawa et al., 2019, He et al., 2021, Ishikawa et al., 2020).
Likelihood-free/posterior inference: Variational Bayes or simulation-based inference where log-likelihood (or its normalizing constant) is itself an expectation (He et al., 2021, Yang et al., 2024).
Smoothing and proximal mappings: Moreau–Yoshida envelope gradients, dimension-free randomized smoothing, and non-smooth stochastic optimization (Asi et al., 2021).
Gradient flows and SDE sampling: Kinetic Langevin–type discretizations for posterior mean estimation (Chada et al., 2023).

The MLMC estimators are structurally compatible with both score-function and reparameterization/automatic-differentiation-based approaches, provided hierarchical couplings for unbiased differences are constructed at each level.

3. Construction and Algorithmic Details

A typical unbiased MLMC estimator involves:

Level-wise sample allocation: For each level $\psi_\ell(x)$ 7, with sample size $\psi_\ell(x)$ 8, form a Monte Carlo proxy $\psi_\ell(x)$ 9 for the inner/nested/stochastic expectation.
Antithetic coupling: Coupling $M_\ell = 2^\ell$ 0 samples permits construction of two half-sized proxies $M_\ell = 2^\ell$ 1, yielding level increments

$M_\ell = 2^\ell$ 2

This coupling reduces the variance of the increment, which decays as $M_\ell = 2^\ell$ 3 for smooth objectives and moments bounded away from zero (Goda et al., 2022, Ishikawa et al., 2020).

Randomization over levels: Using probabilities $M_\ell = 2^\ell$ 4, with $M_\ell = 2^\ell$ 5, the single-term estimator $M_\ell = 2^\ell$ 6 is unbiased with finite variance and cost (Goda et al., 2020, Chada et al., 2023).
Stochastic optimization integration: The unbiased estimator (or batched average) can be fed into any SGD rule; all guarantees for convex or strongly convex optimization hold, as per classical stochastic approximation (Goda et al., 2020, Goda et al., 2022).

Numerous algorithmic variants exist:

Hybrid or Truncated MLMC computes all increments up to a base level deterministically, randomizes in upper levels, and may truncate maximal level $M_\ell = 2^\ell$ 7 to trade off a small residual bias for further variance reduction (Yang et al., 2024).
Adaptive sample allocation via pilot runs or theoretical variance/cost balancing (e.g., $M_\ell = 2^\ell$ 8 if $M_\ell = 2^\ell$ 9, $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 0 are the variance and cost at level $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 1, respectively) achieves asymptotically optimal use of total computation (Fujisawa et al., 2019).
RQMC inner loops can be integrated at each level to accelerate convergence rates, especially when coupled with smooth integrands and suitable point set constructions (He et al., 2021).
Special-case estimators (e.g., for squared-loss objectives) can exploit analytical variance ordering or antithetic pairing that guarantee lower variance (Goda et al., 2022).

4. Variance, Cost, and Complexity Properties

For all unbiased MLMC gradient estimators under suitable regularity (increment variance decay with exponent $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 2; cost per increment $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 3):

Variance bound: $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 4,
Cost bound: Average work per estimator is $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 5 ( $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 6 for proper $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 7),
MLMC complexity theorem: To achieve MSE $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 8, total cost is $\nabla F(x) = \sum_{\ell=0}^\infty \mathbb E [\Delta_\ell(x)], \qquad \Delta_0(x) = \psi_0(x),\quad \Delta_\ell(x) = \psi_\ell(x) - \psi_{\ell-1}(x)$ 9, which is optimal and superior to $\omega_\ell$ 0 cost for single-level nested estimators (which must balance bias $\omega_\ell$ 1 and variance $\omega_\ell$ 2 via $\omega_\ell$ 3) (Ishikawa et al., 2020, He et al., 2021).
Gradient norm convergence: In optimization, $\omega_\ell$ 4 for standard step-size schedules and convex (even strongly convex) objectives (Goda et al., 2020, Goda et al., 2022).
Signal-to-noise ratio (SNR): For MLMC-VI, $\omega_\ell$ 5, and convergence is improved by sample sizing and step scheduling (Fujisawa et al., 2019).

These results are robust to the selection of randomization strategies (single-term, Russian roulette, or generalized variants), with practical initialization parameters (e.g., truncation point, base sample size) set via pilot variance estimates or inefficiency minimization (Yang et al., 2024).

5. Selected Applications and Empirical Behavior

MLMC gradient estimators are deployed across a range of inference and optimization problems:

Variational Inference (MLMC-VI): Demonstrated on hierarchical linear regression ( $\omega_\ell$ 6), Bayesian logistic regression, and Bayesian neural network regression. MLMC-VI achieves accelerated convergence, lower gradient variance, and sample size reduction compared to MC and RQMC baselines (Fujisawa et al., 2019).
Nested Expectation Optimization: In Bayesian experimental design, MLMC-SGD outperforms standard MC by converging to correct optima, uses fewer inner samples per gradient, and reduces bias to negligible levels (Goda et al., 2020). For simulation-based posterior inference, MLMC estimators debias the nested log-normalizer gradient in neural posterior estimation, yielding robust convergence guarantees (Yang et al., 2024).
Variational Bayes with Intractable Likelihoods: MLMC-based estimators for the gradient of ELBO eliminate bias present in previous VBIL methods, recover or improve posterior approximations, and benefit from inner RQMC acceleration, yielding higher ELBOs and improved posterior fit in ABC and GLMM tasks (He et al., 2021).
Evidence Gradient Estimation: For Bayesian latent-variable models, unbiased MLMC gradients for the log-evidence can be obtained using coupled importance-sampling batches, leading to an $\omega_\ell$ 7 complexity to reach target accuracy—one order of magnitude better than previous debiasing schemes (Ishikawa et al., 2020).
Proximal point and Smoothing: For the Moreau–Yoshida envelope of non-smooth objectives, unbiased MLMC gradients built from ODC subroutines yield dimension-free, $\omega_\ell$ 8 cost per estimator with $\omega_\ell$ 9 total cost for target MSE, supporting efficient randomized smoothing and projection-efficient optimization (Asi et al., 2021).
Kinetic Langevin Dynamics: In Bayesian kinetic Langevin sampling with inexact gradients, MLMC estimators for the equilibrium expectation achieve unbiasedness, finite variance, and $\sum_l \omega_l = 1$ 0 total gradient cost, with effectiveness demonstrated in high-dimensional applications including MNIST multinomial regression (Chada et al., 2023).

Empirical findings across studies confirm faster optimization, substantially reduced estimator variance, and, crucially, the elimination of bias that plagues fixed-sample nested MC approaches.

6. Practical Implementation and Algorithmic Variants

Key algorithmic components for implementation include:

Antithetic sample splitting: Essential for strong coupling at each MLMC level, dramatically reducing the variance of level increments (He et al., 2021, Yang et al., 2024).
Level randomization strategies: Single-term (RU-MLMC), Russian roulette (GRR-MLMC), and truncated randomization (TGRR-MLMC) methods provide different bias–variance–cost trade-offs; TGRR-MLMC is empirically favored for balancing variance control and residual bias (Yang et al., 2024).
Adaptive sample-size scheduling (e.g., Algorithms 1 in (Fujisawa et al., 2019)) automatically decrease work per iteration as optimization progresses, supported by theoretical analyses on variance and SNR.
Control variates: In score-function estimators and ELBO settings, control variates can be integrated for additional variance reduction without affecting unbiasedness (He et al., 2021).
Coupled stochastic gradients: In large-scale SDEs or kinetic Langevin, coupling subsampled gradients across time-steps and levels is critical for unbiased level increments and complexity bounds (Chada et al., 2023).

Pseudocode templates for each setting are available and detailed in the cited works.

7. Limitations and Open Directions

Although unbiased MLMC gradient estimators are powerful and broadly applicable, their efficiency is limited in some high-variance or highly complex settings, wherein the raw unbiased estimator may still have excessive variance, requiring base-level computations or truncation (Yang et al., 2024). Selection of optimal truncation, geometric rates, and base levels is typically guided via pilot runs and asymptotic inefficiency analyses. In principle, the methodology is effective whenever the variance of level increments decays geometrically, which relies on suitable regularity and moment conditions for the underlying problem (Ishikawa et al., 2020, Chada et al., 2023).

Open directions include robustly integrating these estimators with adaptive batch sizes, automated hyperparameter selection, extension to higher-order stochastic optimization, and exploration of their properties in settings where coupling is challenging or the cost of simulating level increments grows superlinearly.

Key References

(Fujisawa et al., 2019) Multilevel Monte Carlo Variational Inference
(Goda et al., 2022) Constructing unbiased gradient estimators with finite variance for conditional stochastic optimization
(Goda et al., 2020) Unbiased MLMC stochastic gradient-based optimization of Bayesian experimental designs
(He et al., 2021) Unbiased MLMC-based variational Bayes for likelihood-free inference
(Ishikawa et al., 2020) Efficient Debiased Evidence Estimation by Multilevel Monte Carlo Sampling
(Yang et al., 2024) Leveraging Nested MLMC for Sequential Neural Posterior Estimation with Intractable Likelihoods
(Asi et al., 2021) Stochastic Bias-Reduced Gradient Methods
(Chada et al., 2023) Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients