Papers
Topics
Authors
Recent
Search
2000 character limit reached

Random Subspace Cubic-Regularization (R-ARC)

Updated 12 February 2026
  • R-ARC is a family of adaptive second-order optimization methods that restrict each iteration to a random low-dimensional subspace to reduce computational cost.
  • It builds cubic models using projected gradients and Hessians or subsampled curvature, preserving convergence guarantees while accelerating per-iteration speed.
  • Under suitable sampling and embedding schemes, R-ARC achieves global convergence rates equivalent to full-space ARC, making it ideal for low-rank and structurally sparse problems.

Random Subspace Cubic-Regularization (R-ARC) refers to a family of adaptive second-order optimization algorithms that apply the principle of cubic regularization, but restrict each major iteration to a random low-dimensional subspace of the parameter space. Originally motivated by the prohibitive cost of full-space Hessian evaluation and subproblem solving in high dimensions, R-ARC techniques exploit randomization and/or sub-sampling to preserve the theoretical and empirical efficiency of Adaptive Regularization by Cubics (ARC) while reducing per-iteration computation. Recent research demonstrates that under appropriate sampling and embedding schemes, R-ARC achieves global convergence rates and second-order optimality guarantees equivalent to full-dimensional ARC, with substantial speedup—especially for low-rank or structurally sparse objectives (Chen et al., 2018, Tansley et al., 7 Jan 2025, Cartis et al., 16 Jan 2025, Zhao et al., 2024, Hanzely et al., 2020).

1. Mathematical Framework and Subspace Cubic Model

R-ARC methods address unconstrained optimization problems

minxRdf(x),\min_{x\in\mathbb{R}^d} f(x),

where ff is smooth (typically C2C^2), and possibly of finite-sum structure (f(x)=1ni=1nfi(x)f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)). At each iteration xkx_k, the standard (full-space) cubic model is

mk(s)=f(xk)+f(xk)s+12s2f(xk)s+σk6s3,m_k(s) = f(x_k) + \nabla f(x_k)^\top s + \frac{1}{2} s^\top \nabla^2 f(x_k) s + \frac{\sigma_k}{6}\|s\|^3,

with regularization parameter σk\sigma_k. R-ARC constructs and minimizes an analogous cubic model on a random subspace or based on a random sub-sample:

  • Random subspace: A sketching matrix SkRr×dS_k \in \mathbb{R}^{r\times d} or orthonormal matrix UkRd×rU_k \in \mathbb{R}^{d\times r} defines a subspace Uk=span(Uk)\mathcal{U}_k = \text{span}(U_k).
  • Gradient and Hessian projections: Compute g~k=Ukf(xk)\tilde{g}_k = U_k^\top \nabla f(x_k), H~k=Uk2f(xk)Uk\tilde{H}_k = U_k^\top \nabla^2 f(x_k) U_k.
  • Subspace cubic model:

mkr(z)=f(xk)+g~kz+12zH~kz+σk6z3,zRr.m_k^r(z) = f(x_k) + \tilde{g}_k^\top z + \frac{1}{2} z^\top \tilde{H}_k z + \frac{\sigma_k}{6}\|z\|^3, \quad z \in \mathbb{R}^r.

  • Step and acceptance: sk=Ukzks_k = U_k z_k, where zkz_k is the (approximate) minimizer of mkrm_k^r. Step acceptance uses a ratio of actual to predicted reduction, ρk\rho_k.

Alternatively, for finite-sum problems, the Hessian itself may be subsampled: for Sk{1,,n}S_k \subset \{1,\ldots,n\}, compute

H~k=1SkiSk2fi(xk)\tilde{H}_k = \frac{1}{|S_k|} \sum_{i\in S_k} \nabla^2 f_i(x_k)

or use importance sampling proportional to curvature (Chen et al., 2018).

2. Hessian Approximation and Subspace Selection Strategies

R-ARC encompasses several strategies for subspace or Hessian reduction:

  • Uniform coordinate subspace sampling: Sample a random block or coordinate subset to define SkS_k. The projected gradient and Hessian are then restricted to these coordinates.
  • Random Gaussian or oblivious subspace embeddings: Use dense random projections (e.g., scaled Gaussian) to guarantee embedding properties, especially when seeking dimension-independence or low-rank adaptivity (Tansley et al., 7 Jan 2025, Cartis et al., 16 Jan 2025).
  • Finite-sum subsampling: For sum-structured ff, directly subsample Hessian terms (uniformly or by curvature importance) (Chen et al., 2018).

Probabilistic embedding properties (Oblivious Subspace Embedding, OSE) guarantee that subspace projections preserve relevant curvature, provided the subspace dimension is sufficient (typically O(r+1)\mathcal{O}(r+1) for target rank rr).

Subspace size (or sample size) can be fixed or made adaptive. Adaptive selection involves monitoring the observed rank or error in the subspace model (e.g., updating rkr_k if negative curvature directions are found).

3. Algorithmic Structure and Pseudocode

A prototypical R-ARC algorithm is as follows (Cartis et al., 16 Jan 2025, Zhao et al., 2024):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Input: x₀, σ₀, θ, subspace size r or adaptive rule, other algorithmic constants.

For k = 0, 1, 2, ...
    1. Draw random subspace U_k ∈ ℝ^{d×r}
    2. Compute g^r_k = U_k^T ∇f(x_k), H^r_k = U_k^T ∇^2f(x_k) U_k
    3. Approximately minimize
         m^r_k(z) = f(x_k) + (g^r_k)^T z + ½ z^T H^r_k z + (σ_k/6) ||z||^3
       to obtain z_k; set s_k = U_k z_k
    4. Set ρ_k = (f(x_k) - f(x_k + s_k)) / (m^r_k(0) - m^r_k(z_k))
    5. If ρ_k ≥ θ
           x_{k+1} = x_k + s_k; decrease σ_k
       else
           x_{k+1} = x_k; increase σ_k
    6. For adaptive variants: update subspace size r_{k+1} based on subspace Hessian rank or specified rule.

Minimum requirements for the subproblem solution are standard: model decrease and gradient norm in the subspace below specified thresholds.

For finite-sum settings, a two-phase algorithm is employed in (Chen et al., 2018): Phase I uses subsampled adaptive cubic regularization (SSAS) for a moderate-accuracy solution, Phase II applies a Nesterov-style acceleration (ASAS) for optimal complexity.

4. Convergence Theory and Global Complexity

R-ARC achieves convergence rates that interpolate between those of coordinate descent and full Newton/ARC, depending on the subspace size and sampling (Cartis et al., 16 Jan 2025, Zhao et al., 2024, Hanzely et al., 2020, Chen et al., 2018):

  • First-order complexity:
    • With random subspaces of dimension rr, matching OSE conditions, R-ARC finds f(x)ε\|\nabla f(x)\| \leq \varepsilon in O(ε3/2)\mathcal{O}(\varepsilon^{-3/2}) iterations, replicating full ARC rates while per-iteration cost is reduced to O(r)\mathcal{O}(r) gradient/Hessian-vector products and O(r3)\mathcal{O}(r^3) linear algebra (Cartis et al., 16 Jan 2025, Tansley et al., 7 Jan 2025).
    • For random coordinate or minibatch selection of size mm in a dd-dimension problem, R-ARC interpolates: O(d/m)(d/m) coordinate descent rate for first-order methods, O(ε2)(\varepsilon^{-2}) for CD, up to O(ε3/2)(\varepsilon^{-3/2}) as mdm\rightarrow d (full cubic regularization) (Zhao et al., 2024, Hanzely et al., 2020).
  • Second-order complexity: To guarantee λmin(2f(x))ε\lambda_{\min}(\nabla^2 f(x)) \geq -\varepsilon, the iteration count is O(ε3)\mathcal{O}(\varepsilon^{-3}), mirroring ARC (Cartis et al., 16 Jan 2025).
  • High-probability and worst-case bounds: In finite-sum or inexact Hessian settings (e.g., uniform or importance sampling), global iteration complexity is O(ε1/3)(\varepsilon^{-1/3}) (high-probability) and O(ε5/6logε1)(\varepsilon^{-5/6}\log\varepsilon^{-1}) in the worst case for accelerated variants with inexact Hessians (Chen et al., 2018).

5. Adaptivity, Low-Rank Structure, and Scalability

R-ARC is particularly effective for functions with low-rank structure:

  • Low-rank objectives: If f(x)=h(Ax)f(x) = h(Ax) for ARr×dA \in \mathbb{R}^{r^*\times d}, so that the Hessian is always rank at most rr^*, then adaptive subspace variants (e.g. R-ARC-D) increase the subspace dimension only as needed, eventually matching the true rank plus one (Tansley et al., 7 Jan 2025, Cartis et al., 16 Jan 2025).
  • Computational cost: For rank-rr problems, per-iteration cost drops from O(d2)\mathcal{O}(d^2) or O(d3)\mathcal{O}(d^3) (full ARC) to O(rd)\mathcal{O}(rd) for gradient/Hessian projections and O(r3)\mathcal{O}(r^3) subproblem solves. R-ARC-D discovers the intrinsic rank online, requiring neither prior knowledge nor manual tuning.
  • Oblivious subspace embeddings: Use of OSEs (e.g., Gaussian sketches) ensures that key step-size and decrease properties required for ARC convergence analysis translate to the projected domain.

Empirical results show clear advantages for R-ARC in scenarios with low effective rank, strong ill-conditioning, or a prohibitive Hessian cost (Tansley et al., 7 Jan 2025, Cartis et al., 16 Jan 2025).

6. Accelerated and Stochastic Variants

  • Nesterov-style acceleration: For finite-sum objectives, phase II acceleration via mirror-descent-like auxiliary sequences achieves an O(ε1/3)(\varepsilon^{-1/3}) rate (high probability) and clear acceleration over non-accelerated cubic regularization (Chen et al., 2018).
  • Stochastic subspace and coordinate implementations: The stochastic subspace cubic Newton (SSCN) method directly applies R-ARC principles, building cubic models on randomly sampled minibatches of coordinates or dimensions, with convergence rates interpolating between first-order and full second-order methods (Hanzely et al., 2020, Zhao et al., 2024).
  • Batch size and regularization parameter selection: Empirically and theoretically, single-digit percent subspace sizes (2–5% of the coordinates) often optimize wall-clock time and trade-off between per-iteration cost and progress. The regularization parameter can be selected in an adaptive fashion to ensure sufficient model decrease and step acceptance.

7. Practical Implementation and Empirical Performance

  • Hessian-vector products: The core computational cost is dominated by rr–dimensional projected-gradient and Hessian-vector products, which are scalable for automatic differentiation and Hessian-free approaches.
  • Subproblem solvers: The subspace cubic subproblem can be solved exactly (e.g., via Cholesky for small rr), or inexactly via gradient-based or Krylov subspace methods (Lanczos).
  • Subspace/sample size heuristics: Cap the subspace dimension between small and moderate fractions of dd (e.g., 0.01dd–0.2dd) for best trade-off, and increase adaptively for low-rank detection. For finite-sum methods, sample complexities scale with the desired gradient accuracy εkε_k as O(ϵk2logd)\mathcal{O}(\epsilon_k^{-2}\log d).
  • Numerical benchmarks: Across high-dimensional logistic regression and low-rank synthetic tasks, R-ARC and adaptive variants demonstrate substantial speed-ups (often 3–10× faster than full ARC at moderate accuracy) and outperform first-order and classical quasi-Newton methods when Hessian curvature is nontrivial or ill-conditioning is present (Chen et al., 2018, Zhao et al., 2024, Cartis et al., 16 Jan 2025, Tansley et al., 7 Jan 2025).

Table: R-ARC Variants and Key Elements

Variant Subspace Mechanism Complexity (First/Second Order)
ARC-D (Tansley et al., 7 Jan 2025) Adaptive random subspace (rank-based) O(ε{-3/2}) / O(ε{-3})
SSCN (Zhao et al., 2024) Random coordinate/minibatch Interpolates O(ε{-2}) to O(ε{-3/2})
Finite-sum (R-ARC) (Chen et al., 2018) Hessian subsampling (uniform/importance) O(ε{-1/3}) (accel., high-prob.)
R-ARC-D (Cartis et al., 16 Jan 2025) Adaptive sketching (OSE) O(ε{-3/2}) / O(ε{-3})

A plausible implication is that for many large-scale or structurally low-dimensional problems, R-ARC and its descendants will become the de facto strategy to leverage second-order information, interpolating efficiently between first-order simplicity and second-order acceleration.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Random Subspace Cubic-Regularization (R-ARC).