Papers
Topics
Authors
Recent
Search
2000 character limit reached

Biased Stochastic Approximation Schemes

Updated 23 January 2026
  • Biased stochastic approximation schemes are iterative methods that integrate persistent, non-diminishing bias via perturbed estimators to handle noise in complex, high-dimensional settings.
  • The methodology employs techniques like tail-averaging and Richardson–Romberg extrapolation to balance bias–variance trade-offs and achieve improved convergence rates.
  • These schemes are pivotal in fields such as deep learning, reinforcement learning, and multi-agent systems, supported by rigorous stationary distribution and non-asymptotic convergence analyses.

Biased stochastic approximation (SA) schemes constitute a class of iterative methods wherein the update rule incorporates perturbations or estimators with persistent, nonzero bias. Distinct from classical SA, which traditionally assumes unbiased noise and diminishing stepsizes to ensure consistency, modern approaches frequently tolerate, or even exploit, bias in order to enhance stability, reduce variance, or accelerate convergence—especially under Markovian observation models, adaptive or constant stepsizes, nonconvex objectives, or in high-dimensional and complex data environments. This paradigm is central to algorithms used in deep learning, reinforcement learning, adaptive Monte Carlo, distributed multi-agent systems, and manifold-based optimization. Recent theoretical work rigorously characterizes the stationary bias and variance of iterates, the geometric ergodicity of induced Markov chains, and provides explicit extrapolation and averaging strategies for bias mitigation.

1. Formal Setup and Types of Bias in Stochastic Approximation

A generic biased stochastic approximation update is of the form

xn+1=xnan(F(xn)+bn+ξn+1),x_{n+1} = x_n - a_n (F(x_n) + b_n + \xi_{n+1}),

where ana_n is a (possibly constant) stepsize, F(xn)F(x_n) is the drift (often taken as a mean-field or gradient), bnb_n is a deterministic or state-dependent bias bounded by ϵ\epsilon, and ξn+1\xi_{n+1} is a martingale difference (with E[ξn+1Fn]=0\mathbb{E}[\xi_{n+1}|\mathcal{F}_n]=0 and bounded conditional variance). The bias may originate from Monte Carlo gradient estimation, iterate-dependent function approximation, Markovian sampling, truncation, asynchronous updates, or intentional design for variance reduction.

Bias structures can be classified as:

  • Persistent/Non-diminishing bias: bnb_n remains bounded away from zero, e.g., from finite-step Markov sampling or non-vanishing approximation error (Paul et al., 16 Jan 2026).
  • Time-dependent or diminishing bias: rn0r_n \to 0 as in adaptive Monte Carlo or increasing batch sizes (Surendran et al., 2024).
  • Operator-induced bias: As encountered in multi-agent asynchronous SA with delay and function approximation, or set-valued drift inclusions (Ramaswamy et al., 2018).
  • Noise-model bias: Markov chain state-dependent drift and observation models imparting nonzero expectation error for each update (Karimi et al., 2019, Lauand et al., 2023).

Specialized settings include Riemannian manifold recursions with biased mean-field estimation (Durmus et al., 2020, Durmus et al., 2021); two-timescale linear stochastic approximation under Markovian noise (Kwon et al., 2024); and constant-step linear SA for temporal difference learning and Q-learning (Huo et al., 2022).

2. Stationary Distribution, Bias–Variance Scaling, and Convergence Theory

Under weak regularity, including geometric ergodicity and Hurwitz conditions for the mean-field operator, constant-step biased SA schemes give rise to a unique stationary distribution for the iterates.

Key results include:

  • Convergence to Stationarity: The iterates (xt,yt,ξt)(x_t, y_t, \xi_t) under a constant stepsize regime converge weakly and exponentially to a unique joint stationary law in Wasserstein distance. The rate is explicit: Wˉ2(L(xt,yt,ξt),μ)Cexp(cαt)\bar W_2(\mathcal{L}(x_t, y_t, \xi_t), \mu) \leq C\,\exp(-c\,\alpha t) with constants computable in terms of system mixing time and drift contraction rates (Kwon et al., 2024, Huo et al., 2022).
  • Bias and Variance Characterization: The stationary mean error ("bias") is linear in the stepsizes,

E[xx]=αbˉ1x+βbˉ2x+O(β2)\mathbb{E}[x_\infty - x^*] = \alpha\,\bar b_1^x + \beta\,\bar b_2^x + O(\beta^2)

E[yy(x)]=αbˉ1y+βbˉ2y+O(β2)\mathbb{E}[y_\infty - y^*(x_\infty)] = \alpha\,\bar b_1^y + \beta\,\bar b_2^y + O(\beta^2)

while variances scale with the corresponding stepsize: Tr(Var(x))=O(α)\mathrm{Tr}(\mathrm{Var}(x_\infty)) = O(\alpha), Tr(Var(y))=O(β)\mathrm{Tr}(\mathrm{Var}(y_\infty)) = O(\beta) (Kwon et al., 2024, Allmeier et al., 2024, Durmus et al., 2021).

  • Bias Expansion and Power-Series: For linear SA under Markovian data, the stationary bias admits an infinite series expansion in α\alpha: E[θ(α)]θ=i=1mαiB(i)+O(αm+1)\mathbb{E}[\theta_\infty^{(\alpha)}] - \theta^* = \sum_{i=1}^m \alpha^i B^{(i)} + O(\alpha^{m+1}) with explicit operators B(i)B^{(i)} computable via Poisson equation solutions (Huo et al., 2022, Lauand et al., 2023).

3. Tail-Averaging, Extrapolation, and Bias Reduction Strategies

Simple post-processing techniques, notably Polyak–Ruppert tail-averaging and Richardson–Romberg extrapolation, permit sharp bias–variance trade-offs without requiring diminishing stepsizes.

  • Tail-Averaging (Polyak–Ruppert): After burn-in, averaging the SA iterates produces consistent estimators with variance O(1/t)O(1/t), but leaves the O(α)O(\alpha) bias in the constant-step regime: E[x~tx]2=O(β2)+O(1/(tt0))\mathbb{E}[\tilde x_t - x^*]^2 = O(\beta^2) + O(1/(t-t_0)) (Kwon et al., 2024, Huo et al., 2022).
  • Richardson–Romberg Extrapolation: Parallel runs at multiple stepsizes allow linear (or higher order) cancellation of bias terms: ζtx=2x~t(α,β)x~t(2α,2β)\zeta_t^x = 2\,\tilde x_t^{(\alpha,\beta)} - \tilde x_t^{(2\alpha,2\beta)} yields

E[ζtxx]2=O(β4)+O(1/(tt0))\mathbb{E}[\zeta_t^x - x^*]^2 = O(\beta^4) + O(1/(t-t_0))

(Kwon et al., 2024, Huo et al., 2022, Allmeier et al., 2024).

  • Bias–Mixing-Time Proportionality: In reversible Markov chains, the leading bias is directly proportional to the mixing time and stepsize; under i.i.d. data the bias vanishes (Huo et al., 2022, Lauand et al., 2023).

4. Non-Asymptotic Analysis and Adaptive Algorithms with Biased Estimators

Adaptive stochastic gradient algorithms (Adagrad, RMSProp, AMSGRAD) have their nonasymptotic convergence rates preserved in the presence of time-dependent bias, given proper control via batch size or estimator parameterization.

  • General ASA Template:

θn+1=θnγn+1AnHθn(Xn+1)\theta_{n+1} = \theta_n - \gamma_{n+1} A_n H_{\theta_n}(X_{n+1})

where Hθn(Xn+1)H_{\theta_n}(X_{n+1}) is a biased gradient estimator (Surendran et al., 2024).

  • Rate Results: In convex and nonconvex settings, with batch size NnnαN_n \sim n^\alpha, bias rnnαr_n \sim n^{-\alpha}, and stepsizes γnn1/2\gamma_n \sim n^{-1/2},

E[V(θR)2]=O(lognn+bn)\mathbb{E}\left[\|\nabla V(\theta_R)\|^2\right] = O\left(\frac{\log n}{\sqrt{n}} + b_n\right)

where bnb_n is governed by bias decay; for α1/4\alpha \geq 1/4, the optimal O(logn/n)O(\log n / \sqrt{n}) rate is achieved (Surendran et al., 2024).

  • Practical Guidance: For importance-weighted autoencoding and Monte Carlo-based estimators, bias-control via Nnn1/4N_n \sim n^{1/4} achieves near-unbiased rates, with empirical confirmation in deep generative modeling (Surendran et al., 2024).

5. Differential Inclusion, Input-to-State Stability, and Multi-Agent/Distributed Settings

The presence of persistent bias changes the limiting dynamics from ODEs to perturbed differential inclusions,

x˙(t)F(x(t))+Bˉ(0,ϵ)\dot{x}(t) \in F(x(t)) + \bar{B}(0,\epsilon)

where Bˉ(0,ϵ)\bar{B}(0,\epsilon) denotes the bias perturbation set.

  • Input-to-State Stability (ISS): If the perturbed DI is ISS, and discrete iterates remain bounded, then convergence to an O(ϵ)O(\epsilon) neighborhood of equilibrium is guaranteed. ISS Lyapunov conditions, global Lipschitz drift, and quadratic potential bounds yield a.s. boundedness and stability (Paul et al., 16 Jan 2026, Ramaswamy et al., 2018).
  • Asynchronous Approximate Dynamics: In distributed and asynchronous multi-agent SA (e.g., A2VI, A2PG), stability is retained under bounded bias. Convergence is to a neighborhood of the solution set, with diameter proportional to the bias magnitude (Ramaswamy et al., 2018).

6. Riemannian Manifold Stochastic Approximation with Bias

Extension to Riemannian SA schemes allows general geometric optimization under bias:

  • Fixed-Step Markov Chain Structure: With constant step size γ\gamma, the chain {θn}\{\theta_n\} admits a unique stationary law that concentrates near the optimum as γ0\gamma \to 0, with bias O(γ)O(\gamma) (Durmus et al., 2021, Durmus et al., 2020).
  • Lyapunov and CLT Theory: The invariant measure bias has leading term

Θgradg(θ),h(θ)μγ(dθ)=γ2[Hessg:Σ](θ)+o(γ)-\int_\Theta \langle \mathrm{grad}\,g(\theta), h(\theta)\rangle\,\mu^\gamma(d\theta) = \frac{\gamma}{2} [\mathrm{Hess}\,g : \Sigma](\theta^*) + o(\gamma)

Variance scales as O(γ)O(\gamma); the central limit theorem quantifies fluctuations around the biased mean (Durmus et al., 2021).

7. Practical Applications, Empirical Validation, and Bias Debiasing Strategies


Taken together, contemporary research provides precise bias–variance analysis, non-asymptotic convergence theory, and robust bias mitigation approaches for a broad class of biased stochastic approximation schemes. These frameworks inform practical algorithm design in deep learning, RL, simulation-based inference, and geometric optimization, offering mathematically grounded recipes for balancing stepsize, bias control, variance reduction, and scaling to high-dimensional or nonconvex objectives.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Biased Stochastic Approximation Schemes.