Papers
Topics
Authors
Recent
Search
2000 character limit reached

Randomized Subspace Normalized SGD

Updated 29 January 2026
  • RS-NSGD is a stochastic optimization algorithm that leverages random subspace selection and direction normalization to control heavy-tailed noise.
  • It significantly reduces memory and computation costs by operating in a lower-dimensional subspace while maintaining high-probability convergence guarantees.
  • RS-NSGD demonstrates improved oracle complexity compared to full-dimensional methods, making it ideal for large-scale and distributed nonconvex problems.

Randomized Subspace Normalized SGD (RS-NSGD) is a stochastic optimization algorithm that integrates direction normalization into randomized subspace stochastic gradient descent, offering favorable convergence rates—especially under heavy-tailed noise conditions prevalent in modern machine learning. By combining random subspace selection with normalization, RS-NSGD demonstrates improved oracle complexity compared to full-dimensional normalized SGD and enables substantial reductions in memory and per-iteration computation cost. These properties make it particularly suitable for large-scale nonconvex optimization problems and distributed scenarios, where conventional methods face significant challenges.

1. Algorithmic Foundations and Update Rule

RS-NSGD addresses the stochastic nonconvex optimization problem

minxRdF(x)=Eξ[f(x;ξ)],\min_{x\in\mathbb R^d} F(x) = \mathbb E_{\xi}\left[f(x; \xi)\right],

where both the parameter dimension dd and sample variability are large. At each iteration kk, the algorithm proceeds as follows:

  • Sample a Haar-distributed random subspace PkRd×rP_k \in \mathbb R^{d \times r}, with rdr \ll d.
  • Draw a minibatch {ξkj}j=1Bˉ\{\xi_k^j\}_{j=1}^{\bar B} and compute the averaged stochastic gradient gk=1Bˉj=1Bˉf(xk;ξkj)g_k = \frac{1}{\bar B} \sum_{j=1}^{\bar B} \nabla f(x_k; \xi_k^j).
  • Project the gradient into the subspace: uk=Pkgku_k = P_k^\top g_k.
  • Normalize the direction within the subspace in accordance with pp-norm moment bounds on the noise (exact normalization rule may vary by implementation).
  • Update via subspace descent: xk+1=xkηˉPkuk,x_{k+1} = x_k - \bar\eta\, P_k\, u_k, with stepsize ηˉ>0\bar\eta>0 that matches the smoothness-induced scaling.

This protocol leverages the statistical and computational advantages of random projections for gradient estimates, while normalization compensates for the deleterious effects of heavy-tailed noise distributions, which can dramatically skew progress in unconstrained stochastic descent.

2. Noise Models, Assumptions, and High-Probability Analysis

The theoretical guarantees of RS-NSGD are predicated on mild assumptions:

  • Objective function FF is LL-smooth.
  • Stochastic gradients are unbiased, i.e., E[f(x;ξ)]=F(x)\mathbb E[\nabla f(x;\xi)] = \nabla F(x).
  • Noise is heavy-tailed, with bounded pp-th moments for some p>2p > 2.

Unlike prior work that focuses on convergence in expectation or under sub-Gaussian noise, RS-NSGD admits both high-probability and in-expectation convergence results even when gradient noise exhibits heavy tails. Specifically, letting μ\mu denote the signal fraction from random subspace projection, Δ0=F(x0)F\Delta_0 = F(x_0) - F^*, and σ2\sigma^2 characterize the noise scale, RS-NSGD yields the following (Theorem 3.1 under sub-Gaussian noise, extended to bounded pp-moments):

min0k<TF(xk)2O~(d3μ2rΔ0Lσ2ε4)\min_{0 \leq k < T} \|\nabla F(x_k)\|^2 \leq \tilde{O}\left( \frac{d^3}{\mu^2 r}\,\Delta_0 L \sigma^2\,\varepsilon^{-4} \right)

with probability at least 1δ1 - \delta for Tμ1log(1/δ)T \gtrsim \mu^{-1} \log(1/\delta), and smaller ε\varepsilon attainable via larger minibatch and rank rr choices, subject to computational budget constraints (Omiya et al., 28 Jan 2026).

3. Normalization and Heavy-Tailed Noise

The direction normalization in RS-NSGD is motivated by empirical findings that stochastic gradients in large-scale machine learning are often heavy-tailed, violating sub-Gaussian hypotheses. The normalization step involves rescaling the projected gradient so that its norm is controlled; one common prescription is to divide the projected gradient uku_k by its pp-norm, yielding a directionally robust step even under high noise, i.e.,

uknorm=ukukpu^{\text{norm}}_k = \frac{u_k}{\|u_k\|_p}

for appropriate pp, thereby attenuating the influence of outlier components. This modification is pivotal in achieving stronger concentration of the optimization trajectory and mitigating excessive variance from high-moment noise sources (Omiya et al., 28 Jan 2026).

4. Oracle Complexity and Comparative Efficiency

The oracle complexity of RS-NSGD compares favorably to both full-dimensional normalized SGD and randomized subspace SGD (RS-SGD):

  • RS-NSGD vs Full-Dim NSGD: The coordinate-oracle complexity for achieving an ε\varepsilon-stationary point is lower for RS-NSGD when rdr \ll d and μ\mu is not too small. Empirically, practical choices (e.g., r=d/10r=d/10) yield substantial savings in memory and communication.
  • RS-NSGD vs RS-SGD: RS-NSGD demonstrates improved high-probability rates and better robustness to noise scaling; the iteration complexity is augmented only by a factor d2/(μ2r2)d^2/(\mu^2 r^2) in the subspace dimension, but per-iteration cost drops from O(d)O(d) to O(r)O(r) (Omiya et al., 28 Jan 2026).

5. Practical Guidelines and Implementation Considerations

Implementation of RS-NSGD requires tuning the following algorithmic parameters:

  • Subspace dimension (rr): Should reflect the effective smoothness rank of the problem; moderate fractions of dd balance accuracy and efficiency.
  • Stepsize (ηˉ\bar\eta): Theoretically optimal is r/(dL)r/(dL) to cancel the smoothness penalty from projection.
  • Minibatch size (Bˉ\bar B): Scaling with TT or larger values reduces noise influence and realizes the optimal high-probability scaling.
  • Direction normalization: Choice of pp (commonly p=2p=2 or p=4p=4) dictated by observed tail behavior.

Efficient PRNG and matrix operations facilitate subspace sampling and gradient projection. When deployed in distributed or federated optimization, communication savings are realized by transmitting only the lower-dimensional projected directions, rather than full gradients.

6. Theoretical and Empirical Comparison Table

Variant Convergence Rate Memory/Comp Cost Oracle Complexity
Full-Dim NSGD O(ε4)O(\varepsilon^{-4}) w.h.p. O(d)O(d) O(dΔ0Lσ2ε4)O(d\,\Delta_0 L \sigma^2\,\varepsilon^{-4})
RS-SGD O(ε4)O(\varepsilon^{-4}) w.h.p. O(r)O(r) O(d3μ2rΔ0Lσ2ε4)O(\frac{d^3}{\mu^2 r}\,\Delta_0 L \sigma^2\,\varepsilon^{-4})
RS-NSGD O(ε4)O(\varepsilon^{-4}) w.h.p., improved for p>2p>2 O(r)O(r) Lower than RS-SGD for bounded pp-moment and non-negligible μ\mu

The Editor's term "oracle complexity" refers to the total number of stochastic gradient evaluations needed to reach a prescribed stationarity threshold.

7. Significance and Context Within Randomized Subspace Methods

RS-NSGD builds on a growing family of randomized subspace optimization methods seeking scalable, communication- and memory-efficient algorithms for nonconvex training of large models. While classical RS-SGD provides strong expectation-based guarantees and substantial reductions in computational load per iteration (Chen et al., 11 Feb 2025), RS-NSGD adds normalization, enabling rigorous high-probability guarantees even when noise deviates from classical assumptions. This suggests that RS-NSGD is particularly well-suited to regimes where variance or heavy-tails are pronounced, such as large-batch training on real-world data.

A plausible implication is that direction normalization may become standard in future randomized and memory-efficient large-scale optimizers, especially for nonconvex objectives where heavy-tailed statistics are encountered routinely.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Subspace Normalized SGD (RS-NSGD).