Biased Stochastic Approximation Schemes

Updated 23 January 2026

Biased stochastic approximation schemes are iterative methods that integrate persistent, non-diminishing bias via perturbed estimators to handle noise in complex, high-dimensional settings.
The methodology employs techniques like tail-averaging and Richardson–Romberg extrapolation to balance bias–variance trade-offs and achieve improved convergence rates.
These schemes are pivotal in fields such as deep learning, reinforcement learning, and multi-agent systems, supported by rigorous stationary distribution and non-asymptotic convergence analyses.

Biased stochastic approximation (SA) schemes constitute a class of iterative methods wherein the update rule incorporates perturbations or estimators with persistent, nonzero bias. Distinct from classical SA, which traditionally assumes unbiased noise and diminishing stepsizes to ensure consistency, modern approaches frequently tolerate, or even exploit, bias in order to enhance stability, reduce variance, or accelerate convergence—especially under Markovian observation models, adaptive or constant stepsizes, nonconvex objectives, or in high-dimensional and complex data environments. This paradigm is central to algorithms used in deep learning, reinforcement learning, adaptive Monte Carlo, distributed multi-agent systems, and manifold-based optimization. Recent theoretical work rigorously characterizes the stationary bias and variance of iterates, the geometric ergodicity of induced Markov chains, and provides explicit extrapolation and averaging strategies for bias mitigation.

1. Formal Setup and Types of Bias in Stochastic Approximation

A generic biased stochastic approximation update is of the form

$x_{n+1} = x_n - a_n (F(x_n) + b_n + \xi_{n+1}),$

where $a_n$ is a (possibly constant) stepsize, $F(x_n)$ is the drift (often taken as a mean-field or gradient), $b_n$ is a deterministic or state-dependent bias bounded by $\epsilon$ , and $\xi_{n+1}$ is a martingale difference (with $\mathbb{E}[\xi_{n+1}|\mathcal{F}_n]=0$ and bounded conditional variance). The bias may originate from Monte Carlo gradient estimation, iterate-dependent function approximation, Markovian sampling, truncation, asynchronous updates, or intentional design for variance reduction.

Bias structures can be classified as:

Persistent/Non-diminishing bias: $b_n$ remains bounded away from zero, e.g., from finite-step Markov sampling or non-vanishing approximation error (Paul et al., 16 Jan 2026).
Time-dependent or diminishing bias: $r_n \to 0$ as in adaptive Monte Carlo or increasing batch sizes (Surendran et al., 2024).
Operator-induced bias: As encountered in multi-agent asynchronous SA with delay and function approximation, or set-valued drift inclusions (Ramaswamy et al., 2018).
Noise-model bias: Markov chain state-dependent drift and observation models imparting nonzero expectation error for each update (Karimi et al., 2019, Lauand et al., 2023).

Specialized settings include Riemannian manifold recursions with biased mean-field estimation (Durmus et al., 2020, Durmus et al., 2021); two-timescale linear stochastic approximation under Markovian noise (Kwon et al., 2024); and constant-step linear SA for temporal difference learning and Q-learning (Huo et al., 2022).

2. Stationary Distribution, Bias–Variance Scaling, and Convergence Theory

Under weak regularity, including geometric ergodicity and Hurwitz conditions for the mean-field operator, constant-step biased SA schemes give rise to a unique stationary distribution for the iterates.

Key results include:

Convergence to Stationarity: The iterates $(x_t, y_t, \xi_t)$ under a constant stepsize regime converge weakly and exponentially to a unique joint stationary law in Wasserstein distance. The rate is explicit: $\bar W_2(\mathcal{L}(x_t, y_t, \xi_t), \mu) \leq C\,\exp(-c\,\alpha t)$ with constants computable in terms of system mixing time and drift contraction rates (Kwon et al., 2024, Huo et al., 2022).
Bias and Variance Characterization: The stationary mean error ("bias") is linear in the stepsizes,

$\mathbb{E}[x_\infty - x^*] = \alpha\,\bar b_1^x + \beta\,\bar b_2^x + O(\beta^2)$

$\mathbb{E}[y_\infty - y^*(x_\infty)] = \alpha\,\bar b_1^y + \beta\,\bar b_2^y + O(\beta^2)$

while variances scale with the corresponding stepsize: $\mathrm{Tr}(\mathrm{Var}(x_\infty)) = O(\alpha)$ , $\mathrm{Tr}(\mathrm{Var}(y_\infty)) = O(\beta)$ (Kwon et al., 2024, Allmeier et al., 2024, Durmus et al., 2021).

Bias Expansion and Power-Series: For linear SA under Markovian data, the stationary bias admits an infinite series expansion in $\alpha$ : $\mathbb{E}[\theta_\infty^{(\alpha)}] - \theta^* = \sum_{i=1}^m \alpha^i B^{(i)} + O(\alpha^{m+1})$ with explicit operators $B^{(i)}$ computable via Poisson equation solutions (Huo et al., 2022, Lauand et al., 2023).

3. Tail-Averaging, Extrapolation, and Bias Reduction Strategies

Simple post-processing techniques, notably Polyak–Ruppert tail-averaging and Richardson–Romberg extrapolation, permit sharp bias–variance trade-offs without requiring diminishing stepsizes.

Tail-Averaging (Polyak–Ruppert): After burn-in, averaging the SA iterates produces consistent estimators with variance $O(1/t)$ , but leaves the $O(\alpha)$ bias in the constant-step regime: $\mathbb{E}[\tilde x_t - x^*]^2 = O(\beta^2) + O(1/(t-t_0))$ (Kwon et al., 2024, Huo et al., 2022).
Richardson–Romberg Extrapolation: Parallel runs at multiple stepsizes allow linear (or higher order) cancellation of bias terms: $\zeta_t^x = 2\,\tilde x_t^{(\alpha,\beta)} - \tilde x_t^{(2\alpha,2\beta)}$ yields

$\mathbb{E}[\zeta_t^x - x^*]^2 = O(\beta^4) + O(1/(t-t_0))$

(Kwon et al., 2024, Huo et al., 2022, Allmeier et al., 2024).

Bias–Mixing-Time Proportionality: In reversible Markov chains, the leading bias is directly proportional to the mixing time and stepsize; under i.i.d. data the bias vanishes (Huo et al., 2022, Lauand et al., 2023).

4. Non-Asymptotic Analysis and Adaptive Algorithms with Biased Estimators

Adaptive stochastic gradient algorithms (Adagrad, RMSProp, AMSGRAD) have their nonasymptotic convergence rates preserved in the presence of time-dependent bias, given proper control via batch size or estimator parameterization.

General ASA Template:

$\theta_{n+1} = \theta_n - \gamma_{n+1} A_n H_{\theta_n}(X_{n+1})$

where $H_{\theta_n}(X_{n+1})$ is a biased gradient estimator (Surendran et al., 2024).

Rate Results: In convex and nonconvex settings, with batch size $N_n \sim n^\alpha$ , bias $r_n \sim n^{-\alpha}$ , and stepsizes $\gamma_n \sim n^{-1/2}$ ,

$\mathbb{E}\left[\|\nabla V(\theta_R)\|^2\right] = O\left(\frac{\log n}{\sqrt{n}} + b_n\right)$

where $b_n$ is governed by bias decay; for $\alpha \geq 1/4$ , the optimal $O(\log n / \sqrt{n})$ rate is achieved (Surendran et al., 2024).

Practical Guidance: For importance-weighted autoencoding and Monte Carlo-based estimators, bias-control via $N_n \sim n^{1/4}$ achieves near-unbiased rates, with empirical confirmation in deep generative modeling (Surendran et al., 2024).

5. Differential Inclusion, Input-to-State Stability, and Multi-Agent/Distributed Settings

The presence of persistent bias changes the limiting dynamics from ODEs to perturbed differential inclusions,

$\dot{x}(t) \in F(x(t)) + \bar{B}(0,\epsilon)$

where $\bar{B}(0,\epsilon)$ denotes the bias perturbation set.

Input-to-State Stability (ISS): If the perturbed DI is ISS, and discrete iterates remain bounded, then convergence to an $O(\epsilon)$ neighborhood of equilibrium is guaranteed. ISS Lyapunov conditions, global Lipschitz drift, and quadratic potential bounds yield a.s. boundedness and stability (Paul et al., 16 Jan 2026, Ramaswamy et al., 2018).
Asynchronous Approximate Dynamics: In distributed and asynchronous multi-agent SA (e.g., A2VI, A2PG), stability is retained under bounded bias. Convergence is to a neighborhood of the solution set, with diameter proportional to the bias magnitude (Ramaswamy et al., 2018).

6. Riemannian Manifold Stochastic Approximation with Bias

Extension to Riemannian SA schemes allows general geometric optimization under bias:

Fixed-Step Markov Chain Structure: With constant step size $\gamma$ , the chain $\{\theta_n\}$ admits a unique stationary law that concentrates near the optimum as $\gamma \to 0$ , with bias $O(\gamma)$ (Durmus et al., 2021, Durmus et al., 2020).
Lyapunov and CLT Theory: The invariant measure bias has leading term

$-\int_\Theta \langle \mathrm{grad}\,g(\theta), h(\theta)\rangle\,\mu^\gamma(d\theta) = \frac{\gamma}{2} [\mathrm{Hess}\,g : \Sigma](\theta^*) + o(\gamma)$

Variance scales as $O(\gamma)$ ; the central limit theorem quantifies fluctuations around the biased mean (Durmus et al., 2021).

7. Practical Applications, Empirical Validation, and Bias Debiasing Strategies

Temporal Difference and Q-Learning: Constant step size TD and Q-learning algorithms under Markovian noise have stationary bias $O(\alpha)$ , with explicit correction formulas via Poisson equations or generator-based expansions (Kwon et al., 2024, Huo et al., 2022, Wang et al., 2019, Lauand et al., 2023, Allmeier et al., 2024).
Monte Carlo Debiasing Techniques: Randomized truncation methods permit the construction of unbiased estimators from biased sequences, utilizing telescoping sums and judicious choice of random stopping time distributions (McLeish, 2010).
Bias–Variance Trade-offs: In variance-reduced (VR) and memory-based SGD methods (e.g., SARAH, SARGE, SAAG-III/IV), intentional bias is leveraged to accelerate convergence and reduce variance per gradient call, with explicit oracle complexity bounds matching the theoretical minima (Driggs et al., 2019, Chauhan et al., 2018).
Experimental confirmation: Empirical studies confirm the theory in high-dimensional generative modeling, Markovian RL, multi-agent learning, and manifold-based principal component analysis (Surendran et al., 2024, Kwon et al., 2024, Ramaswamy et al., 2018, Durmus et al., 2020).

Taken together, contemporary research provides precise bias–variance analysis, non-asymptotic convergence theory, and robust bias mitigation approaches for a broad class of biased stochastic approximation schemes. These frameworks inform practical algorithm design in deep learning, RL, simulation-based inference, and geometric optimization, offering mathematically grounded recipes for balancing stepsize, bias control, variance reduction, and scaling to high-dimensional or nonconvex objectives.