Papers
Topics
Authors
Recent
Search
2000 character limit reached

RS-NSGD: Randomized Subspace Normalized SGD

Updated 29 January 2026
  • The paper introduces RS-NSGD, which projects gradients onto random low-dimensional subspaces with normalization to ensure robust progress under heavy-tailed noise.
  • RS-NSGD reduces per-iteration computational cost from full-dimensional updates to O(r) operations while providing both in-expectation and high-probability convergence guarantees.
  • The algorithm offers oracle complexity improvements and practical guidelines for selecting stepsize, minibatch size, and subspace dimension for efficient training.

Randomized Subspace Normalized SGD (RS-NSGD) is a stochastic optimization algorithm designed to reduce per-iteration computational and memory costs in high-dimensional nonconvex learning problems by performing stochastic gradient updates within randomly chosen low-dimensional subspaces and incorporating direction normalization. This approach achieves both in-expectation and high-probability convergence guarantees, including under heavy-tailed gradient noise, and yields oracle complexity improvements over full-dimensional normalized SGD in appropriate regimes. RS-NSGD synthesizes developments from the broader class of randomized subspace stochastic gradient algorithms (RS-SGD), which project or sparsify gradients for efficiency, and advances this paradigm with normalization strategies suited to machine learning scenarios where heavy-tailed stochastic gradients are prevalent (Omiya et al., 28 Jan 2026).

1. Algorithmic Structure and Update Rule

RS-NSGD targets the general stochastic nonconvex optimization problem

minxRdF(x)=Eξ[f(x;ξ)]\min_{x \in \mathbb{R}^d} F(x) = \mathbb{E}_\xi[f(x; \xi)]

where dd is large and direct full-dimensional stochastic gradient descent (SGD) can be computationally or memory prohibitive.

At iteration kk, RS-NSGD samples:

  • A Haar-distributed random subspace, PkRd×rP_k \in \mathbb{R}^{d \times r}, with rdr \ll d
  • A minibatch {ξkj}j=1Bˉ\{\xi_k^j\}_{j=1}^{\bar{B}}

The stochastic (mini-batch averaged) gradient is gk=1Bˉj=1Bˉf(xk;ξkj)Rdg_k = \frac{1}{\bar{B}}\sum_{j=1}^{\bar{B}}\nabla f(x_k;\xi_k^j) \in \mathbb{R}^d.

RS-NSGD then forms the subspace-projected update: uk=PkgkRru_k = P_k^\top g_k \in \mathbb{R}^r It normalizes this direction (details in [(Omiya et al., 28 Jan 2026), §4.2]), and the update step is

xk+1=xkηˉPkukuk+ϵx_{k+1} = x_k - \bar{\eta} \cdot \frac{P_k u_k}{\|u_k\| + \epsilon}

where ηˉ\bar{\eta} is a stepsize and ϵ\epsilon is a regularization parameter. The normalization is crucial for robust progress, especially when the underlying gradient noise is heavy-tailed or the norm of projected gradients varies widely, mitigating the risk of update explosions or stagnation.

This structure generalizes prior randomized subspace methods by integrating explicit normalization in the subspace direction.

2. Theoretical Assumptions and High-Probability Guarantees

The theoretical analysis [(Omiya et al., 28 Jan 2026), §2.2–3.2] is built upon standard conditions for nonconvex stochastic optimization:

  • FF is lower bounded (F=infxF(x)>F_* = \inf_x F(x) > -\infty)
  • LL-smoothness: F(x)F(y)Lxy\|\nabla F(x) - \nabla F(y)\| \leq L \|x-y\|
  • Unbiasedness of the stochastic gradient: E[f(x;ξ)]=F(x)\mathbb{E}[\nabla f(x; \xi)] = \nabla F(x)
  • Noise model: For RS-NSGD, only bounded pp-th moment of the noise is required (less restrictive than sub-Gaussianity)

A key innovation is the derivation of non-asymptotic, high-probability convergence bounds under both sub-Gaussian and heavy-tailed noise [(Omiya et al., 28 Jan 2026), §4]. The main convergence result states: for any δ(0,1)\delta \in (0,1), with suitable choices of minibatch size Bˉ\bar{B}, stepsize ηˉ\bar{\eta}, and for TT large enough,

min0k<TEF(xk)2O~(d3μ2rΔ0Lσ2ε4)\min_{0 \leq k < T} \mathbb{E}\left\|\nabla F(x_k)\right\|^2 \leq \tilde{O}\left(\frac{d^{3}}{\mu^{2} r}\Delta_0 L \sigma^2 \varepsilon^{-4}\right)

where μ\mu encodes the probability that the random subspace captures sufficient gradient energy, and Δ0\Delta_0 depends on initial suboptimality. Notably, this bound matches—in terms of order—the best-known high-probability guarantees for full-dimensional SGD under sub-Gaussian noise, but extends to heavier-tailed settings via normalization.

3. Comparison to Prior Randomized Subspace Methods

RS-NSGD shares the basic descent-in-random-subspace paradigm with established RS-SGD variants (Gressmann et al., 2020, Chen et al., 11 Feb 2025, Rajabi et al., 2 Oct 2025). However, previous analyses almost exclusively establish in-expectation rates, and their bounds rely on sub-Gaussian, bounded-variance noise (Chen et al., 11 Feb 2025, Gressmann et al., 2020). High-probability and robust heavy-tail results were previously unavailable or limited.

The table below synthesizes the core distinctions:

Method Update Rule Convergence Guarantee Noise Model Memory Reduction
RS-SGD xk+1=xkηPkukx_{k+1} = x_k - \eta P_k u_k In-expectation Sub-Gaussian Yes
RS-NSGD xk+1=xkηPkuk/ukx_{k+1} = x_k - \eta P_k u_k/\|u_k\| High-probability, in-expectation Bounded pp-th moment / heavy-tailed Yes

RS-NSGD's normalization is particularly crucial for non-Gaussian, heavy-tailed regimes often encountered in deep learning, and its analysis demonstrates oracle complexity improvements over full-dimensional Normalized SGD, where each step incurs cost dd but RS-NSGD incurs rr at the expense of more iterations.

4. Oracle Complexity, Parameterization, and Practical Implications

The iteration complexity for achieving mink<TF(xk)ε\min_{k<T}\|\nabla F(x_k)\| \leq \varepsilon with high probability is

O~(d3μ2rΔ0Lσ2ε4)\tilde{O}\left(\frac{d^{3}}{\mu^{2} r} \Delta_0 L \sigma^2 \varepsilon^{-4}\right)

Per-iteration cost is O(r)O(r) oracle calls, versus O(d)O(d) for full-dimensional Normalized SGD. For moderate to large dd and practical rdr \ll d, overall computational cost and wall-clock time can be significantly reduced when μ\mu is not too small (i.e., the subspace is large enough to consistently capture gradient signal).

Practical guidelines [(Omiya et al., 28 Jan 2026), §7] include:

  • Stepsize: ηˉr/(dL)\bar{\eta} \approx r/(dL) to offset smoothness penalty
  • Subspace dimension rr: Choose as a moderate fraction of dd (e.g., rd/10r \approx d/10), or aligned with “effective rank” of the Hessian or gradient empirical covariance
  • Minibatch size Bˉ\bar{B}: Proportional to TT, to ensure concentration in heavy-tailed or high-noise regimes

While RS-NSGD's theoretical properties are foregrounded in (Omiya et al., 28 Jan 2026), its technical lineage ties into a burgeoning ecosystem of subspace, sparsification, or quantization techniques for large-scale training:

No empirical benchmarks for RS-NSGD-specific implementations are provided in (Omiya et al., 28 Jan 2026), but practical deployments are plausible in large-batch, high-dimensional, heavy-tailed regimes where standard (full-dim) normalized SGD is unstable or inefficient.

6. Limitations, Open Challenges, and Future Directions

Several limitations and open questions remain:

  • The success of RS-NSGD depends on subspace dimension parameter rr: too small rr degrades convergence due to insufficient gradient signal; too large rr undermines computational savings.
  • The theoretical iteration count is inflated by factors that depend polynomially on d/rd/r and μ\mu; optimizing these via data-dependent or adaptive subspace selection is a target for future research (Chen et al., 11 Feb 2025, Rajabi et al., 2 Oct 2025).
  • The analysis suggests tight high-probability bounds depend on concentration properties of projected gradients and minibatch noise; sharpened bounds with relaxed assumptions merit further work.

Possible future improvements include adaptive rank selection, integration of second-order information in the subspace, blockwise or structured randomization to further compress memory, and empirical evaluation on heavy-tailed tasks and modern large-scale neural architectures.


Principal References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Subspace SGD (RS-SGD).