Stochastic ADMM: Algorithms & Convergence

Updated 1 May 2026

Stochastic ADMM is a family of first-order optimization algorithms that incorporates stochastic approximations to solve large-scale, linearly constrained problems.
It extends classical ADMM by integrating variance reduction, proximal updates, and adaptive schemes to enhance convergence and scalability in applications like machine learning and signal processing.
The approach offers rigorous convergence guarantees for both convex and nonconvex objectives, with advanced variants supporting continuous-time analyses and robust distributed implementations.

Stochastic Alternating Direction Method of Multipliers

Stochastic Alternating Direction Method of Multipliers (Stochastic ADMM) refers to a family of first-order optimization algorithms for solving linearly constrained problems in which some or all components of the objective function are given in stochastic (data-driven or sample-based) form. These methods extend classical ADMM to accommodate large-scale, nonsmooth, and nonconvex objectives by leveraging stochastic approximations, variance reduction, and, in recent years, continuous-time analysis. Stochastic ADMM is foundational in distributed, online, and robust machine learning as well as in high-dimensional signal processing and control.

1. Problem Formulation and ADMM Fundamentals

Stochastic ADMM targets convex or nonconvex optimization problems with separable objective and linear constraints: $\min_{x\in\mathcal{X},\,y\in\mathcal{Y}} \ \mathbb{E}_{\xi}[\theta_1(x, \xi)] + \theta_2(y) \quad \text{s.t.} \quad A x + B y = b$ where:

$x\in\mathbb{R}^{d_1}$ , $y\in\mathbb{R}^{d_2}$ ; $A\in\mathbb{R}^{m\times d_1}$ , $B\in\mathbb{R}^{m\times d_2}$ , $b\in\mathbb{R}^m$ .
$\theta_1(x, \xi)$ is a (possibly nonsmooth/nonconvex) instance-specific loss, $\theta_2(y)$ is a separable regularizer.
$\mathcal{X},\mathcal{Y}$ are closed, convex constraint sets; $\{\xi_k\}$ is a sequence of i.i.d. data samples.
The (augmented) Lagrangian is $x\in\mathbb{R}^{d_1}$ 0.

Classical (deterministic) ADMM alternates between minimization over $x\in\mathbb{R}^{d_1}$ 1 and $x\in\mathbb{R}^{d_1}$ 2 and a dual ascent. However, in stochastic and data-intensive regimes, replacing deterministic subproblems with stochastic approximations or variance-reduced updates is critical for computational scalability (Ouyang et al., 2012).

2. Core Algorithmic Schemes

Stochastic ADMM schemes retain the three-step outer iteration of ADMM but use random samples and often introduce proximal or linearized subproblems. Canonical stochastic ADMM (Ouyang et al., 2012) for convex objectives is:

$x\in\mathbb{R}^{d_1}$ 3-update:

$x\in\mathbb{R}^{d_1}$ 4

$x\in\mathbb{R}^{d_1}$ 5-update:

$x\in\mathbb{R}^{d_1}$ 6

Dual-update:

$x\in\mathbb{R}^{d_1}$ 7

Step-size $x\in\mathbb{R}^{d_1}$ 8 is adapted based on problem regularity and the convergence regime (Ouyang et al., 2012).

Variants include:

Linearized/Proximal ADMM: adding Bregman-divergence or explicit quadratic regularizers (Zhao et al., 2013).
Mini-batch and variance-reduced ADMM: integrating control variates or snapshot-based estimators for improved convergence, including SVRG-ADMM, SAGA-ADMM, and SCAS-ADMM (Zhao et al., 2015, Zheng et al., 2016, Huang et al., 2016).
Adaptive stochastic ADMM: using time-varying or coordinatewise proximal matrices for per-coordinate adaptivity (Zhao et al., 2013).
Accelerated stochastic ADMM: Nesterov-type extrapolation and momentum incorporation for $x\in\mathbb{R}^{d_1}$ 9 non-ergodic rates (Fang et al., 2017).

Continuous-time formulations recast the stochastic ADMM iterates as weak approximations to stochastic differential equations (SDEs), shedding light on the role of over-relaxation, noise, and bias-variance trade-offs (Zhou et al., 2020, Li, 2024).

3. Theoretical Guarantees and Convergence Rates

Rigorous analysis requires assumptions on bounded second-moment or variance of the stochastic gradients, convexity or strong convexity of objective terms, and (optionally) smoothness or the Kurdyka-Łojasiewicz (KL) property for nonconvexity (Ouyang et al., 2012, Bian et al., 2020).

Key results:

For general convex objectives:

$y\in\mathbb{R}^{d_2}$ 0

where averages $y\in\mathbb{R}^{d_2}$ 1 are over $y\in\mathbb{R}^{d_2}$ 2 iterates (Ouyang et al., 2012).

For $y\in\mathbb{R}^{d_2}$ 3-strongly convex objectives:

$y\in\mathbb{R}^{d_2}$ 4

(Ouyang et al., 2012).

Variance-reduced and accelerated schemes (e.g., SA-ADMM, SCAS-ADMM, SVRG-ADMM):

$y\in\mathbb{R}^{d_2}$ 5

in both objective gap and feasibility violation, matching batch ADMM under similar regularity (see Table below) (Zhong et al., 2013, Zhao et al., 2015, Zheng et al., 2016, Fang et al., 2017).

Method	Rate	Memory cost
Batch ADMM	O(1/T)	O(lp + lq)
SA-ADMM	O(1/T)	O(np + lp + lq)
SCAS-ADMM	O(1/T)	O(lp + lq)
SVRG-ADMM	O(1/T)	O(d d̃)
ACC-SADMM	O(1/T) non-ergodic	O(d)

In nonconvex problems with variance reduction, $y\in\mathbb{R}^{d_2}$ 6 rates in expectation for stationary solutions are established under L-smoothness and bounded gradient assumptions (Huang et al., 2016, Huang et al., 2020).

Recent Hilbert-space extensions incorporate infinite-dimensional constraints (e.g., PDE-constrained optimal control), achieving nonergodic $y\in\mathbb{R}^{d_2}$ 7 convergence in the strongly convex case and $y\in\mathbb{R}^{d_2}$ 8 in the general convex by integrating Nesterov extrapolation and adaptive penalty schedules (Deng et al., 10 Mar 2026).

4. Advanced Variants and Extensions

Variance-Reduced and Accelerated Stochastic ADMM

Variance reduction, by maintaining history (SAG-ADMM, SAGA-ADMM) or snapshot-based control (SVRG-ADMM, SCAS-ADMM), permits $y\in\mathbb{R}^{d_2}$ 9 convergence in expectation and, with suitable acceleration (e.g., momentum, Nesterov extrapolation), can reach non-ergodic $A\in\mathbb{R}^{m\times d_1}$ 0 rates optimal for separable linearly constrained problems (Zheng et al., 2016, Fang et al., 2017). Accelerated stochastic ADMM achieves further improvements with optimal dependence on the smoothness constant for empirical risk minimization (Zhang et al., 2016).

Nonconvex and Nonsmooth Stochastic ADMM

For nonconvex objectives, recent research deploys variance-reduced estimators (SVRG, SAGA, SARAH, SPIDER) in the ADMM inner loop, ensuring global convergence under finite-sum or expectation-based objectives and sometimes requiring the KL property for global analysis (Bian et al., 2020, Huang et al., 2020). Under mild regularity, algorithms achieve $A\in\mathbb{R}^{m\times d_1}$ 1 complexity for $A\in\mathbb{R}^{m\times d_1}$ 2-stationarity (Huang et al., 2016).

Adaptive and Robust Versions

Adaptive stochastic ADMM generalizes the proximal term to per-coordinate Bregman divergences, closely related to AdaGrad, and can provably minimize the dual-norm regret term over time, especially beneficial in high-dimensional or ill-conditioned regimes (Zhao et al., 2013).

Distributed and byzantine-robust stochastic ADMM extends the formulation for multi-agent scenarios, adding consensus-form constraints and robustness penalties to manage untrusted or faulty nodes (Lin et al., 2021).

Continuous-Time and SME Theory

By interpreting stochastic ADMM iterates as discrete samples of an SDE (“stochastic modified equation”) (Zhou et al., 2020, Li, 2024), new insight emerges into the bias–variance trade-off, role of over-relaxation, and optimal stopping: for instance, under proper scaling, the $A\in\mathbb{R}^{m\times d_1}$ 3-trajectory of G-sADMM weakly converges to

$A\in\mathbb{R}^{m\times d_1}$ 4

where the matrix $A\in\mathbb{R}^{m\times d_1}$ 5 incorporates algorithmic parameters and underpins the bias-variance dynamics (Li, 2024).

5. Implementation Practices and Empirical Behavior

Pseudocode for basic stochastic ADMM is:

$b\in\mathbb{R}^m$ 8 (Ouyang et al., 2012)

Variance-reduced, mini-batch, block-wise, and accelerated versions have increased per-iteration complexity, but exhibit superior empirical scaling and rate, especially on large-scale objectives (see, e.g., comparisons in (Zhao et al., 2015, Zheng et al., 2016, Zhao et al., 2013, Fang et al., 2017)). Empirical benchmarks consistently report:

Stochastic and variance-reduced ADMM methods outperform batch/deterministic ADMM in early and mid-stage optimization.
Storage cost is a critical consideration—SVRG/SCAS-type approaches with $A\in\mathbb{R}^{m\times d_1}$ 6 memory scale to large $A\in\mathbb{R}^{m\times d_1}$ 7, while SAG-style require $A\in\mathbb{R}^{m\times d_1}$ 8.
Implementation details such as penalty schedule, step-size tuning, and constraint over-relaxation may affect both speed and feasibility violation (Ouyang et al., 2012, Zhao et al., 2015, Li, 2024).

6. Applications and Extensions

Stochastic ADMM and its variants are foundational in:

Distributed and federated learning (including byzantine-robust regimes) (Lin et al., 2021).
Structured and graph-constrained regression (e.g., generalized lasso, graph-guided fused lasso).
Large-scale empirical risk minimization with $A\in\mathbb{R}^{m\times d_1}$ 9/ $B\in\mathbb{R}^{m\times d_2}$ 0-regularization or group structure (Zheng et al., 2016, Zhong et al., 2013).
Nonconvex learning, robust estimation, and black-box or zeroth-order optimization in adversarial settings (Bian et al., 2020, Huang et al., 2019).
PDE-constrained stochastic control and infinite-dimensional optimization (Deng et al., 10 Mar 2026).

Ongoing advances involve multi-block extensions, adaptive or dynamic constraint penalty schemes, continuous-time formulations, and robust or decentralized communication design.

7. Summary Table: Stochastic ADMM Landscape

Algorithm	Objective Type	Rate (convex)	Memory	Key Features	Reference
Stochastic ADMM	NSE, convex	$B\in\mathbb{R}^{m\times d_2}$ 1	$B\in\mathbb{R}^{m\times d_2}$ 2	Proximal stochastic x-update	(Ouyang et al., 2012)
SA-ADMM	NSE, convex	$B\in\mathbb{R}^{m\times d_2}$ 3	$B\in\mathbb{R}^{m\times d_2}$ 4	Surrogate gradient, full memory	(Zhong et al., 2013)
SCAS-ADMM	Smooth, convex	$B\in\mathbb{R}^{m\times d_2}$ 5	$B\in\mathbb{R}^{m\times d_2}$ 6	Variance reduction, sparse memory	(Zhao et al., 2015)
SVRG-ADMM	Smooth, convex/NCX	$B\in\mathbb{R}^{m\times d_2}$ 7 / $B\in\mathbb{R}^{m\times d_2}$ 8	$B\in\mathbb{R}^{m\times d_2}$ 9	Epoch-based variance reduction	(Zheng et al., 2016)
SADMM (KL)	Nonsmooth, nonconvex	$b\in\mathbb{R}^m$ 0 (stationarity)	-	VRADMM, global convergence under KL	(Bian et al., 2020)
ADA-SADMM	Convex	$b\in\mathbb{R}^m$ 1	$b\in\mathbb{R}^m$ 2	Adaptive Bregman proximal, AdaGrad link	(Zhao et al., 2013)
ACC-SADMM	Convex	$b\in\mathbb{R}^m$ 3 non-erg.	$b\in\mathbb{R}^m$ 4	Accelerated, Nesterov, dual compensation	(Fang et al., 2017)
SM-ADMM	Any (via SDE)	SDE analysis	$b\in\mathbb{R}^m$ 5	Weak convergence, bias-variance trade-off	(Zhou et al., 2020)
Hilbert-SADMM	Hilbert/Infinite	$b\in\mathbb{R}^m$ 6/ $b\in\mathbb{R}^m$ 7	-	Nesterov acceleration, nonergodic rates	(Deng et al., 10 Mar 2026)

Abbreviations: NSE = Nonsmooth (Separable); NCX = Nonconvex; VR = Variance Reduction; KL = Kurdyka-Łojasiewicz property.

Stochastic ADMM crystallizes the union of stochastic optimization, convex analysis, and modern algorithmic design. Variants exploiting adaptive preconditioning, variance reduction, Nesterov acceleration, and continuous-time theory continue to extend both its theoretical boundaries and its practical reach (Ouyang et al., 2012, Zhao et al., 2015, Zheng et al., 2016, Fang et al., 2017, Zhao et al., 2013, Deng et al., 10 Mar 2026).