Random Subspace Cubic-Regularization (R-ARC)
- R-ARC is a family of adaptive second-order optimization methods that restrict each iteration to a random low-dimensional subspace to reduce computational cost.
- It builds cubic models using projected gradients and Hessians or subsampled curvature, preserving convergence guarantees while accelerating per-iteration speed.
- Under suitable sampling and embedding schemes, R-ARC achieves global convergence rates equivalent to full-space ARC, making it ideal for low-rank and structurally sparse problems.
Random Subspace Cubic-Regularization (R-ARC) refers to a family of adaptive second-order optimization algorithms that apply the principle of cubic regularization, but restrict each major iteration to a random low-dimensional subspace of the parameter space. Originally motivated by the prohibitive cost of full-space Hessian evaluation and subproblem solving in high dimensions, R-ARC techniques exploit randomization and/or sub-sampling to preserve the theoretical and empirical efficiency of Adaptive Regularization by Cubics (ARC) while reducing per-iteration computation. Recent research demonstrates that under appropriate sampling and embedding schemes, R-ARC achieves global convergence rates and second-order optimality guarantees equivalent to full-dimensional ARC, with substantial speedup—especially for low-rank or structurally sparse objectives (Chen et al., 2018, Tansley et al., 7 Jan 2025, Cartis et al., 16 Jan 2025, Zhao et al., 2024, Hanzely et al., 2020).
1. Mathematical Framework and Subspace Cubic Model
R-ARC methods address unconstrained optimization problems
where is smooth (typically ), and possibly of finite-sum structure (). At each iteration , the standard (full-space) cubic model is
with regularization parameter . R-ARC constructs and minimizes an analogous cubic model on a random subspace or based on a random sub-sample:
- Random subspace: A sketching matrix or orthonormal matrix defines a subspace .
- Gradient and Hessian projections: Compute , .
- Subspace cubic model:
- Step and acceptance: , where is the (approximate) minimizer of . Step acceptance uses a ratio of actual to predicted reduction, .
Alternatively, for finite-sum problems, the Hessian itself may be subsampled: for , compute
or use importance sampling proportional to curvature (Chen et al., 2018).
2. Hessian Approximation and Subspace Selection Strategies
R-ARC encompasses several strategies for subspace or Hessian reduction:
- Uniform coordinate subspace sampling: Sample a random block or coordinate subset to define . The projected gradient and Hessian are then restricted to these coordinates.
- Random Gaussian or oblivious subspace embeddings: Use dense random projections (e.g., scaled Gaussian) to guarantee embedding properties, especially when seeking dimension-independence or low-rank adaptivity (Tansley et al., 7 Jan 2025, Cartis et al., 16 Jan 2025).
- Finite-sum subsampling: For sum-structured , directly subsample Hessian terms (uniformly or by curvature importance) (Chen et al., 2018).
Probabilistic embedding properties (Oblivious Subspace Embedding, OSE) guarantee that subspace projections preserve relevant curvature, provided the subspace dimension is sufficient (typically for target rank ).
Subspace size (or sample size) can be fixed or made adaptive. Adaptive selection involves monitoring the observed rank or error in the subspace model (e.g., updating if negative curvature directions are found).
3. Algorithmic Structure and Pseudocode
A prototypical R-ARC algorithm is as follows (Cartis et al., 16 Jan 2025, Zhao et al., 2024):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Input: x₀, σ₀, θ, subspace size r or adaptive rule, other algorithmic constants.
For k = 0, 1, 2, ...
1. Draw random subspace U_k ∈ ℝ^{d×r}
2. Compute g^r_k = U_k^T ∇f(x_k), H^r_k = U_k^T ∇^2f(x_k) U_k
3. Approximately minimize
m^r_k(z) = f(x_k) + (g^r_k)^T z + ½ z^T H^r_k z + (σ_k/6) ||z||^3
to obtain z_k; set s_k = U_k z_k
4. Set ρ_k = (f(x_k) - f(x_k + s_k)) / (m^r_k(0) - m^r_k(z_k))
5. If ρ_k ≥ θ
x_{k+1} = x_k + s_k; decrease σ_k
else
x_{k+1} = x_k; increase σ_k
6. For adaptive variants: update subspace size r_{k+1} based on subspace Hessian rank or specified rule. |
Minimum requirements for the subproblem solution are standard: model decrease and gradient norm in the subspace below specified thresholds.
For finite-sum settings, a two-phase algorithm is employed in (Chen et al., 2018): Phase I uses subsampled adaptive cubic regularization (SSAS) for a moderate-accuracy solution, Phase II applies a Nesterov-style acceleration (ASAS) for optimal complexity.
4. Convergence Theory and Global Complexity
R-ARC achieves convergence rates that interpolate between those of coordinate descent and full Newton/ARC, depending on the subspace size and sampling (Cartis et al., 16 Jan 2025, Zhao et al., 2024, Hanzely et al., 2020, Chen et al., 2018):
- First-order complexity:
- With random subspaces of dimension , matching OSE conditions, R-ARC finds in iterations, replicating full ARC rates while per-iteration cost is reduced to gradient/Hessian-vector products and linear algebra (Cartis et al., 16 Jan 2025, Tansley et al., 7 Jan 2025).
- For random coordinate or minibatch selection of size in a -dimension problem, R-ARC interpolates: O coordinate descent rate for first-order methods, O for CD, up to O as (full cubic regularization) (Zhao et al., 2024, Hanzely et al., 2020).
- Second-order complexity: To guarantee , the iteration count is , mirroring ARC (Cartis et al., 16 Jan 2025).
- High-probability and worst-case bounds: In finite-sum or inexact Hessian settings (e.g., uniform or importance sampling), global iteration complexity is O (high-probability) and O in the worst case for accelerated variants with inexact Hessians (Chen et al., 2018).
5. Adaptivity, Low-Rank Structure, and Scalability
R-ARC is particularly effective for functions with low-rank structure:
- Low-rank objectives: If for , so that the Hessian is always rank at most , then adaptive subspace variants (e.g. R-ARC-D) increase the subspace dimension only as needed, eventually matching the true rank plus one (Tansley et al., 7 Jan 2025, Cartis et al., 16 Jan 2025).
- Computational cost: For rank- problems, per-iteration cost drops from or (full ARC) to for gradient/Hessian projections and subproblem solves. R-ARC-D discovers the intrinsic rank online, requiring neither prior knowledge nor manual tuning.
- Oblivious subspace embeddings: Use of OSEs (e.g., Gaussian sketches) ensures that key step-size and decrease properties required for ARC convergence analysis translate to the projected domain.
Empirical results show clear advantages for R-ARC in scenarios with low effective rank, strong ill-conditioning, or a prohibitive Hessian cost (Tansley et al., 7 Jan 2025, Cartis et al., 16 Jan 2025).
6. Accelerated and Stochastic Variants
- Nesterov-style acceleration: For finite-sum objectives, phase II acceleration via mirror-descent-like auxiliary sequences achieves an O rate (high probability) and clear acceleration over non-accelerated cubic regularization (Chen et al., 2018).
- Stochastic subspace and coordinate implementations: The stochastic subspace cubic Newton (SSCN) method directly applies R-ARC principles, building cubic models on randomly sampled minibatches of coordinates or dimensions, with convergence rates interpolating between first-order and full second-order methods (Hanzely et al., 2020, Zhao et al., 2024).
- Batch size and regularization parameter selection: Empirically and theoretically, single-digit percent subspace sizes (2–5% of the coordinates) often optimize wall-clock time and trade-off between per-iteration cost and progress. The regularization parameter can be selected in an adaptive fashion to ensure sufficient model decrease and step acceptance.
7. Practical Implementation and Empirical Performance
- Hessian-vector products: The core computational cost is dominated by –dimensional projected-gradient and Hessian-vector products, which are scalable for automatic differentiation and Hessian-free approaches.
- Subproblem solvers: The subspace cubic subproblem can be solved exactly (e.g., via Cholesky for small ), or inexactly via gradient-based or Krylov subspace methods (Lanczos).
- Subspace/sample size heuristics: Cap the subspace dimension between small and moderate fractions of (e.g., 0.01–0.2) for best trade-off, and increase adaptively for low-rank detection. For finite-sum methods, sample complexities scale with the desired gradient accuracy as .
- Numerical benchmarks: Across high-dimensional logistic regression and low-rank synthetic tasks, R-ARC and adaptive variants demonstrate substantial speed-ups (often 3–10× faster than full ARC at moderate accuracy) and outperform first-order and classical quasi-Newton methods when Hessian curvature is nontrivial or ill-conditioning is present (Chen et al., 2018, Zhao et al., 2024, Cartis et al., 16 Jan 2025, Tansley et al., 7 Jan 2025).
Table: R-ARC Variants and Key Elements
| Variant | Subspace Mechanism | Complexity (First/Second Order) |
|---|---|---|
| ARC-D (Tansley et al., 7 Jan 2025) | Adaptive random subspace (rank-based) | O(ε{-3/2}) / O(ε{-3}) |
| SSCN (Zhao et al., 2024) | Random coordinate/minibatch | Interpolates O(ε{-2}) to O(ε{-3/2}) |
| Finite-sum (R-ARC) (Chen et al., 2018) | Hessian subsampling (uniform/importance) | O(ε{-1/3}) (accel., high-prob.) |
| R-ARC-D (Cartis et al., 16 Jan 2025) | Adaptive sketching (OSE) | O(ε{-3/2}) / O(ε{-3}) |
A plausible implication is that for many large-scale or structurally low-dimensional problems, R-ARC and its descendants will become the de facto strategy to leverage second-order information, interpolating efficiently between first-order simplicity and second-order acceleration.
References
- Accelerating Adaptive Cubic Regularization of Newton’s Method via Random Sampling (Chen et al., 2018)
- Random Subspace Cubic-Regularization Methods, with Applications to Low-Rank Functions (Cartis et al., 16 Jan 2025)
- Cubic Regularized Subspace Newton for Non-Convex Optimization (Zhao et al., 2024)
- Scalable Second-Order Optimization Algorithms for Minimizing Low-rank Functions (Tansley et al., 7 Jan 2025)
- Stochastic Subspace Cubic Newton Method (Hanzely et al., 2020)