Randomized Coordinate Descent (RCD)

Updated 13 February 2026

Randomized Coordinate Descent (RCD) is an iterative optimization algorithm that updates randomly selected variable subsets to efficiently minimize differentiable functions.
It leverages importance, block, and volume sampling techniques to reduce per-iteration complexity and exploit problem structures in high-dimensional settings.
Enhanced variants integrate acceleration, quantized communication, and variance reduction to achieve fast convergence in convex and nonconvex regimes.

Randomized Coordinate Descent (RCD) is a class of iterative optimization algorithms in which, at each iteration, a randomly selected subset of coordinates (variables) is updated while the rest are kept fixed. RCD generalizes full-gradient methods by leveraging the sparsity, block structure, or decomposability of the underlying objective to greatly reduce per-iteration complexity. Rigorous convergence theory covers strongly convex, convex, and certain nonconvex settings, and modern variants integrate block sampling, acceleration, communication-efficiency, and specialized sampling strategies to further enhance performance.

1. General Algorithmic Framework and Theoretical Foundations

The prototypical RCD method targets minimization of a differentiable function $f:\mathbb{R}^d \to \mathbb{R}$ , often partitioned as $\mathbb{R}^d = \mathbb{R}^{d_1} \times \cdots \times \mathbb{R}^{d_B}$ for block coordinate updates. At each iteration, a random subset of coordinates (or a block) is chosen according to a probability distribution, and only those coordinates are updated using a partial gradient or another local oracle.

A unifying abstraction is the unbiased sketch framework, where at each step a diagonal random matrix $C_t$ is sampled such that $\mathbb{E}[C_t \nabla f(x)] = \nabla f(x)$ , defining the sampling distribution via its “probability matrix” $P = \mathbb{E}[C_t C_t] \succ 0$ (Szlendak et al., 2023).

The update rule is

$x^{t+1} = x^t - \eta_t\,C_t\nabla f(x^t)$

with step size $\eta_t$ chosen based on a key smoothness-in-probability constant

$L_p = \lambda_{\max}\left(P^{-1/2} L P^{-1/2}\right)$

where $L$ is a smoothness matrix bounding the local quadratic model of $f$ .

The convergence properties are governed by $L_p$ and the structure of the sketch. For single-coordinate updates (classical RCD), $C_t = e_i e_i^{\top}/p_i$ , and $i$ is drawn with probability $p_i$ .

Convergence rates:

Strongly convex: $\mathbb{E}\|x^k - x^*\|^2 \leq (1 - \eta\mu)^k \|x^0 - x^*\|^2$ for $\eta \leq 1/L_p$ (Szlendak et al., 2023).
General convex: $O(1/(L_p \epsilon))$ iteration complexity to target function accuracy $\epsilon$ .
Nonconvex: $O(L_p/\epsilon)$ steps to find $\epsilon$ -stationary points under $L$ -smoothness.

This unified framework encapsulates both the classical coordinate descent analysis [Nesterov 2012] and modern block/randomized/progressive extensions.

2. Block, Importance, and Volume Sampling Strategies

RCD can exploit various coordinate and block selection rules to accelerate convergence or handle ill-conditioned problems:

Importance sampling: Coordinates are sampled with probabilities proportional to their coordinate-wise smoothness constants, $p_i \propto L_i$ , minimizing $L_p$ (Szlendak et al., 2023, Rodomanov et al., 2019).
Block sampling/robust variants: Larger blocks can be sampled either uniformly or with weights that promote informative, curvature-capturing groups. Robust block coordinate descent employs partial second-order models on randomly sampled blocks, plus line search and inexactness tolerance, and has provable global and local quadratic/superlinear convergence for strongly convex problems (Fountoulakis et al., 2014).
Volume sampling: Generalizes importance sampling to blocks of size $s > 1$ , selecting sets $S$ with probability proportional to $\det(M_{S,S})$ (principal submatrix of the curvature matrix). This strategy interpolates between coordinate Lipschitz sampling ( $s=1$ ) and blockwise selection, achieving accelerated convergence tied to spectral gap ratios in the smoothness matrix (Rodomanov et al., 2019).
Progressive/block ensemble approaches: RPT (Randomized Progressive Training) randomizes the progressive growth of blocks and tunes sampling using a total cost bound based on per-block computational costs and smoothness (Szlendak et al., 2023).

3. Distributed, Quantized, and Communication-Efficient RCD

In distributed settings, data and computation are spread over multiple nodes, often communicating over finite-capacity channels:

Distributed RCD with quantized updates: Employs finite-resolution communication by quantizing partial derivatives before broadcast. Convergence is guaranteed when the quantization step $\Delta$ is chosen below an explicit threshold (proportional to target accuracy, smoothness, and condition number), and the iteration complexity degrades only logarithmically with quantization error. This enables communication-efficient solvers for large-scale learning over bandwidth-limited networks (Gamal et al., 2016).
Subspace-constrained variants: For linear systems with poor spectral decay, RCD performance can deteriorate due to adverse eigenvalue distributions. Subspace-constrained RCD (SC-RCD) restricts the dynamics to a lower-dimensional affine subspace defined by a (Nyström) low-rank approximation, neutralizing spectral outliers and achieving faster rates depending only on the spectral tail (Lok et al., 11 Jun 2025).
Stability and generalization: RCD methods have been shown to possess superior (uniform and on-average) algorithmic stability properties compared to stochastic gradient descent, resulting in tighter data-dependent generalization bounds in convex and strongly convex settings (Wang et al., 2021).

4. Acceleration, Spectral Augmentation, and Nonconvex Analysis

Recent advances accelerate RCD by leveraging additional directions or variance reduction techniques:

Accelerated RCD: By maintaining auxiliary sequences mimicking Nesterov’s momentum, accelerated RCD achieves optimal $O(1/T^2)$ and linear rates in stochastic and online regimes, with per-iteration cost $O(1)$ , and strictly outperforms standard online RCD and stochastic gradient methods in practical regimes (Bhandari et al., 2018).
Spectral and conjugate enrichment: “Stochastic Spectral/Conjugate Descent” augments the coordinate set with eigenvectors or conjugate directions, interpolating between RCD and accelerated condition-number-independent methods. This gives iteration bounds $O(\mathrm{Tr}A/\lambda_1)$ for RCD, $O(n)$ for SSD/SConD, and intermediate rates for hybrids (Kovalev et al., 2018). Adding a handful of spectral directions can yield orders-of-magnitude acceleration when the objective has poor condition number.
Nonconvex escape and saddle-point avoidance: Almost sure avoidance of strict saddle points is established for uniformly randomized RCD under standard smoothness and nondegeneracy, via random dynamical system and center-stable manifold analysis. No explicit noise injection or step-size annealing is required; coordinate randomness suffices for global convergence to local minima (Chen et al., 11 Aug 2025).

5. Applications and Problem-Adapted Variants

RCD and its extensions power a broad variety of large-scale machine learning and optimization problems:

Empirical Risk Minimization (ERM): Both primal and dual RCD can be theoretically and practically optimal with proper importance sampling reflecting data sparsity and structure. For $d\gg n$ features, primal RCD is preferred; for $n\gg d$ data, dual RCD may be faster, though sparsity nuances can reverse this rule. Optimal sampling is computable in closed-form and minimizes total cost across diverse data types (Csiba et al., 2016).
Linear systems and kernel methods: For systems $X\beta = y$ , RCD with norm-proportional sampling enjoys linear convergence with rate $1 - \lambda_{\min}(X^\top X)/\mathrm{Tr}(X^\top X)$ . These results extend to kernel ridge regression via Kaczmarz-style updates that never require explicit Gram matrix formation (Ramdas, 2014).
Submodular minimization: In decomposable submodular minimization, block RCD methods outperform alternating projection (AP) techniques by requiring fewer projections and achieving faster geometric convergence. Accelerated RCD halves the duality gap every $O(n \cdot r^{3/2})$ updates (Ene et al., 2015).
Quantum parameterized circuit optimization: In quantum variational algorithms under stochastic gradient observability, RCD attains (in total measurement cost) the same or superior stability and efficiency as full-gradient descent, especially under coordinate-wise anisotropic smoothness, yielding up to $d$ -fold cost reductions (Ding et al., 2023).
Resource allocation and open multi-agent systems: RCD applied to resource allocation with coupling constraints converges linearly in expectation on complete graphs and maintains robust performance in open systems with agent arrivals and departures, subject to a threshold on the update/replacement ratio (Galland et al., 2021).

6. Practical Implementation, Cost Models, and Modern Variants

Per-iteration cost: RCD achieves $O(1)$ or $O(\mathrm{block~size})$ iteration cost, rendering it attractive for high-dimensional and sparse problems. For block or distributed variants, careful selection of sampling and update rules matches memory, compute, and communication constraints (Fountoulakis et al., 2014, Lok et al., 11 Jun 2025).
Variance reduction and efficient stochastic sampling: Naive RCD-based gradient surrogates can dramatically increase variance in some sampling frameworks (e.g., Langevin Monte Carlo), which can negate computational savings. Refined variance-reduced RCD schemes, such as RCAD, restore cost efficiency in sampling scenarios (Ding et al., 2020).
Optimized sampling for cost-sensitive settings: In settings with heterogeneous blockwise costs and smoothness, optimal RCD sampling is achieved by $p_i \propto \sqrt{L_i / c_i}$ , minimizing the product of iteration complexity and expected per-step cost. This yields significant speedups in regimes with skewed block characteristics (Szlendak et al., 2023).
Block-coupled and linearly constrained problems: For objectives with global coupling (e.g., linear constraints), RCD can be adapted to update two coordinates at a time in a feasibility-preserving subspace, with tight $O(1/k)$ and linear rates for smooth and strongly convex objectives, respectively (Fan et al., 2017).

7. Comparative Performance, Open Questions, and Future Directions

Recent theoretical breakthroughs prove that random-permutation coordinate descent (RPCD, sampling without replacement per epoch) strictly outperforms basic RCD (sampling with replacement) for positive-definite quadratic objectives with permutation-invariant structure. This confirms and quantifies the “reshuffling folklore” in both theory and practice (Kim et al., 29 May 2025).

Outstanding research questions include efficient volume/block sampling for large blocks, accelerated and “composite” versions integrating nonsmooth regularization, improved quantization and communication mechanisms for federated learning, and further robust generalization and nonconvex global convergence guarantees.

References: (Szlendak et al., 2023, Gamal et al., 2016, Rodomanov et al., 2019, Fountoulakis et al., 2014, Ene et al., 2015, Ding et al., 2023, Chen et al., 11 Aug 2025, Kim et al., 29 May 2025, Lok et al., 11 Jun 2025, Csiba et al., 2016, Bhandari et al., 2018, Kovalev et al., 2018, Wang et al., 2021, Ramdas, 2014, Fan et al., 2017, Galland et al., 2021, Ding et al., 2020).