Stochastic ADMM Variants
- Stochastic ADMM Variants are optimization methods that extend classical ADMM by using stochastic updates and adaptive Bregman divergences for data-dependent regularization.
- They enable large-scale machine learning and empirical risk minimization by processing mini-batches to achieve faster convergence and tighter regret bounds.
- Practical implementations balance efficiency and accuracy through diagonal or full-matrix adaptive prox strategies, ensuring robust performance on high-dimensional and streaming data.
Stochastic ADMM Variants are a class of optimization algorithms that generalize classical Alternating Direction Method of Multipliers (ADMM) to stochastic regimes by incorporating single-sample or minibatch-based updates and leveraging adaptive Bregman divergences as proximal regularization. These variants are particularly relevant for large-scale machine learning and empirical risk minimization, where evaluating the entire empirical objective at every iteration is computationally prohibitive. By randomizing the update direction and incorporating data-adaptive curvature (via Bregman divergences tuned by the observed gradients), stochastic ADMM variants achieve improved regret bounds, faster empirical convergence, and scalable applicability to high-dimensional and streaming data contexts (Zhao et al., 2013).
1. Problem Setting and Motivation
Stochastic ADMM variants address problems of the form: where is typically an expectation or empirical mean over a large data set (e.g., ), and is convex (possibly with structure-inducing constraints). The classical ADMM framework, while well-established for convex problems with full-batch access, becomes non-viable when access to the expected loss or its full gradient is computationally infeasible.
Stochastic ADMM approaches substitute the full expected loss by a randomly sampled loss term or its stochastic gradient at each iteration, mimicking stochastic gradient descent, and introduce an adaptive Bregman divergence as the regularizing proximal term, replacing the static quadratic penalty of conventional ADMM. This mechanism enables adaptive, data-dependent regularization that aligns with the geometry encountered over the stochastic sequence of gradients (Zhao et al., 2013).
2. Algorithmic Structure and Bregman Divergences
The stochastic Bregman ADMM variant maintains the essential multi-block iterative nature of ADMM but crucially incorporates two modifications:
- Stochastic sampling: At iteration , only a single random example (or a mini-batch) is processed, resulting in a stochastic (sub)gradient .
- Adaptive Bregman Proximal: The update for each variable includes a regularization term based on a Bregman divergence , where the "prox-function" is chosen adaptively to capture the curvature witnessed over prior iterations. The general update reads:
where denotes the instantaneous loss on the sampled data at step (Zhao et al., 2013).
The choice of Bregman divergence is crucial: while the classical quadratic form recovers Euclidean ADMM, stochastic variants adapt in a data-driven manner (diagonally or full-matrix weighted norms) to track the accumulated curvature of the trajectory.
3. Adaptive Proximal Terms and Online Mirror Descent Connection
A defining property of stochastic ADMM variants is the use of optimal, history-dependent adaptive curvature. At every iteration , the algorithm sets the prox-function as a quadratic norm with curvature matrix : where is online chosen to (approximately) minimize accumulated regret relative to the sequence of observed stochastic gradients . Typical constructions include:
- , (coordinate-wise adaptive)
- , (full-matrix, AdaGrad-like)
Such adaptive strategies ensure that the regret of the stochastic ADMM instance is never worse than the regret of the best prox chosen in hindsight, up to problem-dependent constants. This methodology matches the best possible adaptive subgradient regret bounds in the online learning literature (Zhao et al., 2013).
4. Convergence Properties and Regret Bounds
Stochastic Bregman ADMM achieves rigorous convergence and optimal regret guarantees under convexity. The main result (Theorem 2.1 in (Zhao et al., 2013)) asserts that for iterations and averaged iterates , , one has: with leading constant matching the cumulative adaptive-norm of the gradients.
Specifically, for coordinate-wise adaptive Bregman divergences, the dominant term is , while for fully-adaptive , it is . In both cases, the method achieves ergodic convergence in the objective and constraint residual (Zhao et al., 2013).
5. Practical Implementation and Empirical Performance
Each update of stochastic ADMM variants requires only observing a single stochastic gradient per iteration, making the per-iteration cost independent of the dataset size. The block ADMM structure ensures that updates decouple naturally if or admit simple proximal mappings. The use of diagonal adaptive Bregman divergences is preferred in high-dimensional applications due to superior speed/memory trade-offs, while full-matrix adaptation is effective but often limited to moderate dimensions due to computational constraints (Zhao et al., 2013).
Empirical tests across datasets demonstrate that:
- Adaptive-diagonal and full-matrix stochastic Bregman ADMM achieve substantially faster reduction in objective or feasibility gap per epoch than static-prox stochastic ADMM.
- For instance, on the a9a dataset, two epochs of Ada-diag yield test error 15.01%, outperforming SADMM baseline at 16.46% after the same time budget (Zhao et al., 2013).
- The convergence rate is insensitive to ill-conditioning due to adaptive regularization.
6. Theoretical and Practical Comparison to Deterministic and Nonadaptive Variants
Compared to classical (deterministic) batch ADMM, stochastic variants offer a direct reduction in wall-clock cost by avoiding full-data passes. Compared to nonadaptive stochastic ADMM, the use of history-adaptive Bregman divergences ensures strictly tighter regret bounds with data-dependent leading constants and empirically faster convergence for high-dimensional or non-uniform feature spaces (Zhao et al., 2013).
Theoretical guarantees match the best possible for stochastic online optimization via mirror descent: for general convexity and under strong convexity.
7. Extensions and Related Methodologies
The stochastic Bregman ADMM framework unifies ideas from stochastic mirror descent, online adaptive subgradient methods, and operator splitting. It is compatible with additional regularization and constraint structures, mini-batch variants, and can be generalized to multi-block splitting and distributed computational architectures. Related developments include Bregman-proximal ADMM for nonnegative matrix factorization, optimal transport, and large-scale distributed assignment, where the underlying ideas of stochastic sampling and adaptive Bregman regularization are integral (Chrétien et al., 2015, Zhou et al., 2022, Wang et al., 2013).
References
- "Adaptive Stochastic Alternating Direction Method of Multipliers" (Zhao et al., 2013)
- "Bregman Alternating Direction Method of Multipliers" (Wang et al., 2013)
- "A Bregman Proximal ADMM for NMF with Outliers" (Chrétien et al., 2015)
- "A Practical Distributed ADMM Solver for Billion-Scale Generalized Assignment Problems" (Zhou et al., 2022)