Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stochastic ADMM Variants

Updated 22 February 2026
  • Stochastic ADMM Variants are optimization methods that extend classical ADMM by using stochastic updates and adaptive Bregman divergences for data-dependent regularization.
  • They enable large-scale machine learning and empirical risk minimization by processing mini-batches to achieve faster convergence and tighter regret bounds.
  • Practical implementations balance efficiency and accuracy through diagonal or full-matrix adaptive prox strategies, ensuring robust performance on high-dimensional and streaming data.

Stochastic ADMM Variants are a class of optimization algorithms that generalize classical Alternating Direction Method of Multipliers (ADMM) to stochastic regimes by incorporating single-sample or minibatch-based updates and leveraging adaptive Bregman divergences as proximal regularization. These variants are particularly relevant for large-scale machine learning and empirical risk minimization, where evaluating the entire empirical objective at every iteration is computationally prohibitive. By randomizing the update direction and incorporating data-adaptive curvature (via Bregman divergences tuned by the observed gradients), stochastic ADMM variants achieve improved regret bounds, faster empirical convergence, and scalable applicability to high-dimensional and streaming data contexts (Zhao et al., 2013).

1. Problem Setting and Motivation

Stochastic ADMM variants address problems of the form: minx,zf(x)+g(z)subject toAx+Bz=c\min_{x,\,z} f(x) + g(z) \quad\text{subject to}\quad A x + B z = c where ff is typically an expectation or empirical mean over a large data set (e.g., f(x)=Eξ(x;ξ)f(x) = \mathbb{E}_{\xi} \ell(x;\xi)), and gg is convex (possibly with structure-inducing constraints). The classical ADMM framework, while well-established for convex problems with full-batch access, becomes non-viable when access to the expected loss or its full gradient is computationally infeasible.

Stochastic ADMM approaches substitute the full expected loss by a randomly sampled loss term or its stochastic gradient at each iteration, mimicking stochastic gradient descent, and introduce an adaptive Bregman divergence as the regularizing proximal term, replacing the static quadratic penalty of conventional ADMM. This mechanism enables adaptive, data-dependent regularization that aligns with the geometry encountered over the stochastic sequence of gradients (Zhao et al., 2013).

2. Algorithmic Structure and Bregman Divergences

The stochastic Bregman ADMM variant maintains the essential multi-block iterative nature of ADMM but crucially incorporates two modifications:

  • Stochastic sampling: At iteration tt, only a single random example (or a mini-batch) is processed, resulting in a stochastic (sub)gradient gtg_t.
  • Adaptive Bregman Proximal: The update for each variable includes a regularization term based on a Bregman divergence Dϕt(u,v)D_{\phi_t}(u,v), where the "prox-function" ϕt\phi_t is chosen adaptively to capture the curvature witnessed over prior iterations. The general update reads: xt+1=argminx{ft(x)+λt,Ax+Bztc+ρDϕt(x,xt)}x^{t+1} = \arg\min_x \left\{ f_t(x) + \langle \lambda^t, A x + B z^t - c\rangle + \rho \, D_{\phi_t}(x, x^t) \right\}

zt+1=argminz{g(z)+λt,Axt+1+Bzc+ρDψt(z,zt)}z^{t+1} = \arg\min_z \left\{ g(z) + \langle \lambda^t, A x^{t+1} + B z - c\rangle + \rho \, D_{\psi_t}(z, z^t) \right\}

λt+1=λt+ρ(Axt+1+Bzt+1c)\lambda^{t+1} = \lambda^t + \rho (A x^{t+1} + B z^{t+1} - c)

where ft(x)f_t(x) denotes the instantaneous loss on the sampled data at step tt (Zhao et al., 2013).

The choice of Bregman divergence is crucial: while the classical quadratic form Dϕ(u,v)=12uv22D_\phi(u,v) = \tfrac{1}{2}\|u-v\|_2^2 recovers Euclidean ADMM, stochastic variants adapt ϕt\phi_t in a data-driven manner (diagonally or full-matrix weighted norms) to track the accumulated curvature of the trajectory.

3. Adaptive Proximal Terms and Online Mirror Descent Connection

A defining property of stochastic ADMM variants is the use of optimal, history-dependent adaptive curvature. At every iteration tt, the algorithm sets the prox-function as a quadratic norm with curvature matrix HtH_t: ϕt(w)=12wHt2    Dϕt(w,w)=12(ww)Ht(ww)\phi_t(w) = \tfrac{1}{2}\|w\|_{H_t}^2 \implies D_{\phi_t}(w, w') = \tfrac{1}{2}(w-w')^\top H_t (w-w') where HtH_t is online chosen to (approximately) minimize accumulated regret relative to the sequence of observed stochastic gradients g1,,gtg_1,\dots,g_t. Typical constructions include:

  • Ht=aI+diag(st)H_t = a I + \operatorname{diag}(s_t), st,i={g1:t,i}2s_{t,i} = \|\{g_{1:t, i}\}\|_2 (coordinate-wise adaptive)
  • Ht=aI+Gt1/2H_t = a I + G_t^{1/2}, Gt=i=1tgigiG_t = \sum_{i=1}^t g_i g_i^\top (full-matrix, AdaGrad-like)

Such adaptive strategies ensure that the regret of the stochastic ADMM instance is never worse than the regret of the best prox chosen in hindsight, up to problem-dependent constants. This methodology matches the best possible adaptive subgradient regret bounds in the online learning literature (Zhao et al., 2013).

4. Convergence Properties and Regret Bounds

Stochastic Bregman ADMM achieves rigorous convergence and optimal regret guarantees under convexity. The main result (Theorem 2.1 in (Zhao et al., 2013)) asserts that for TT iterations and averaged iterates wˉT=1Tt=1Twt\bar{w}_T = \frac{1}{T}\sum_{t=1}^T w_t, vˉT=1Tt=1Tvt\bar{v}_T=\frac{1}{T}\sum_{t=1}^T v_t, one has: E[f(wˉT,vˉT)f(w,v)+ρAwˉT+BvˉTb]O(1T)\mathbb{E}\left[ f(\bar{w}_T, \bar{v}_T) - f(w^*, v^*) + \rho \|A\bar{w}_T + B\bar{v}_T - b\| \right] \leq O\left( \frac{1}{T} \right) with leading constant matching the cumulative adaptive-norm of the gradients.

Specifically, for coordinate-wise adaptive Bregman divergences, the dominant term is ig1:T,i2\sum_i \|g_{1:T,i}\|_2, while for fully-adaptive HtH_t, it is tr(GT1/2)\operatorname{tr}(G_T^{1/2}). In both cases, the method achieves O(1/T)O(1/T) ergodic convergence in the objective and constraint residual (Zhao et al., 2013).

5. Practical Implementation and Empirical Performance

Each update of stochastic ADMM variants requires only observing a single stochastic gradient per iteration, making the per-iteration cost independent of the dataset size. The block ADMM structure ensures that updates decouple naturally if ff or gg admit simple proximal mappings. The use of diagonal adaptive Bregman divergences is preferred in high-dimensional applications due to superior speed/memory trade-offs, while full-matrix adaptation is effective but often limited to moderate dimensions due to computational constraints (Zhao et al., 2013).

Empirical tests across datasets demonstrate that:

  • Adaptive-diagonal and full-matrix stochastic Bregman ADMM achieve substantially faster reduction in objective or feasibility gap per epoch than static-prox stochastic ADMM.
  • For instance, on the a9a dataset, two epochs of Ada-diag yield test error 15.01%, outperforming SADMM baseline at 16.46% after the same time budget (Zhao et al., 2013).
  • The convergence rate is insensitive to ill-conditioning due to adaptive regularization.

6. Theoretical and Practical Comparison to Deterministic and Nonadaptive Variants

Compared to classical (deterministic) batch ADMM, stochastic variants offer a direct reduction in wall-clock cost by avoiding full-data passes. Compared to nonadaptive stochastic ADMM, the use of history-adaptive Bregman divergences ensures strictly tighter regret bounds with data-dependent leading constants and empirically faster convergence for high-dimensional or non-uniform feature spaces (Zhao et al., 2013).

Theoretical guarantees match the best possible for stochastic online optimization via mirror descent: O(1/T)O(1/T) for general convexity and O(logT/T)O(\log T/T) under strong convexity.

The stochastic Bregman ADMM framework unifies ideas from stochastic mirror descent, online adaptive subgradient methods, and operator splitting. It is compatible with additional regularization and constraint structures, mini-batch variants, and can be generalized to multi-block splitting and distributed computational architectures. Related developments include Bregman-proximal ADMM for nonnegative matrix factorization, optimal transport, and large-scale distributed assignment, where the underlying ideas of stochastic sampling and adaptive Bregman regularization are integral (Chrétien et al., 2015, Zhou et al., 2022, Wang et al., 2013).


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stochastic ADMM Variants.