RS-NSGD: Randomized Subspace Normalized SGD

Updated 29 January 2026

The paper introduces RS-NSGD, which projects gradients onto random low-dimensional subspaces with normalization to ensure robust progress under heavy-tailed noise.
RS-NSGD reduces per-iteration computational cost from full-dimensional updates to O(r) operations while providing both in-expectation and high-probability convergence guarantees.
The algorithm offers oracle complexity improvements and practical guidelines for selecting stepsize, minibatch size, and subspace dimension for efficient training.

Randomized Subspace Normalized SGD (RS-NSGD) is a stochastic optimization algorithm designed to reduce per-iteration computational and memory costs in high-dimensional nonconvex learning problems by performing stochastic gradient updates within randomly chosen low-dimensional subspaces and incorporating direction normalization. This approach achieves both in-expectation and high-probability convergence guarantees, including under heavy-tailed gradient noise, and yields oracle complexity improvements over full-dimensional normalized SGD in appropriate regimes. RS-NSGD synthesizes developments from the broader class of randomized subspace stochastic gradient algorithms (RS-SGD), which project or sparsify gradients for efficiency, and advances this paradigm with normalization strategies suited to machine learning scenarios where heavy-tailed stochastic gradients are prevalent (Omiya et al., 28 Jan 2026).

1. Algorithmic Structure and Update Rule

RS-NSGD targets the general stochastic nonconvex optimization problem

$\min_{x \in \mathbb{R}^d} F(x) = \mathbb{E}_\xi[f(x; \xi)]$

where $d$ is large and direct full-dimensional stochastic gradient descent (SGD) can be computationally or memory prohibitive.

At iteration $k$ , RS-NSGD samples:

A Haar-distributed random subspace, $P_k \in \mathbb{R}^{d \times r}$ , with $r \ll d$
A minibatch $\{\xi_k^j\}_{j=1}^{\bar{B}}$

The stochastic (mini-batch averaged) gradient is $g_k = \frac{1}{\bar{B}}\sum_{j=1}^{\bar{B}}\nabla f(x_k;\xi_k^j) \in \mathbb{R}^d$ .

RS-NSGD then forms the subspace-projected update: $u_k = P_k^\top g_k \in \mathbb{R}^r$ It normalizes this direction (details in [(Omiya et al., 28 Jan 2026), §4.2]), and the update step is

$x_{k+1} = x_k - \bar{\eta} \cdot \frac{P_k u_k}{\|u_k\| + \epsilon}$

where $\bar{\eta}$ is a stepsize and $\epsilon$ is a regularization parameter. The normalization is crucial for robust progress, especially when the underlying gradient noise is heavy-tailed or the norm of projected gradients varies widely, mitigating the risk of update explosions or stagnation.

This structure generalizes prior randomized subspace methods by integrating explicit normalization in the subspace direction.

2. Theoretical Assumptions and High-Probability Guarantees

The theoretical analysis [(Omiya et al., 28 Jan 2026), §2.2–3.2] is built upon standard conditions for nonconvex stochastic optimization:

$F$ is lower bounded ( $F_* = \inf_x F(x) > -\infty$ )
$L$ -smoothness: $\|\nabla F(x) - \nabla F(y)\| \leq L \|x-y\|$
Unbiasedness of the stochastic gradient: $\mathbb{E}[\nabla f(x; \xi)] = \nabla F(x)$
Noise model: For RS-NSGD, only bounded $p$ -th moment of the noise is required (less restrictive than sub-Gaussianity)

A key innovation is the derivation of non-asymptotic, high-probability convergence bounds under both sub-Gaussian and heavy-tailed noise [(Omiya et al., 28 Jan 2026), §4]. The main convergence result states: for any $\delta \in (0,1)$ , with suitable choices of minibatch size $\bar{B}$ , stepsize $\bar{\eta}$ , and for $T$ large enough,

$\min_{0 \leq k < T} \mathbb{E}\left\|\nabla F(x_k)\right\|^2 \leq \tilde{O}\left(\frac{d^{3}}{\mu^{2} r}\Delta_0 L \sigma^2 \varepsilon^{-4}\right)$

where $\mu$ encodes the probability that the random subspace captures sufficient gradient energy, and $\Delta_0$ depends on initial suboptimality. Notably, this bound matches—in terms of order—the best-known high-probability guarantees for full-dimensional SGD under sub-Gaussian noise, but extends to heavier-tailed settings via normalization.

3. Comparison to Prior Randomized Subspace Methods

RS-NSGD shares the basic descent-in-random-subspace paradigm with established RS-SGD variants (Gressmann et al., 2020, Chen et al., 11 Feb 2025, Rajabi et al., 2 Oct 2025). However, previous analyses almost exclusively establish in-expectation rates, and their bounds rely on sub-Gaussian, bounded-variance noise (Chen et al., 11 Feb 2025, Gressmann et al., 2020). High-probability and robust heavy-tail results were previously unavailable or limited.

The table below synthesizes the core distinctions:

Method	Update Rule	Convergence Guarantee	Noise Model	Memory Reduction
RS-SGD	$x_{k+1} = x_k - \eta P_k u_k$	In-expectation	Sub-Gaussian	Yes
RS-NSGD	$x_{k+1} = x_k - \eta P_k u_k/\\|u_k\\|$	High-probability, in-expectation	Bounded $p$ -th moment / heavy-tailed	Yes

RS-NSGD's normalization is particularly crucial for non-Gaussian, heavy-tailed regimes often encountered in deep learning, and its analysis demonstrates oracle complexity improvements over full-dimensional Normalized SGD, where each step incurs cost $d$ but RS-NSGD incurs $r$ at the expense of more iterations.

4. Oracle Complexity, Parameterization, and Practical Implications

The iteration complexity for achieving $\min_{k<T}\|\nabla F(x_k)\| \leq \varepsilon$ with high probability is

$\tilde{O}\left(\frac{d^{3}}{\mu^{2} r} \Delta_0 L \sigma^2 \varepsilon^{-4}\right)$

Per-iteration cost is $O(r)$ oracle calls, versus $O(d)$ for full-dimensional Normalized SGD. For moderate to large $d$ and practical $r \ll d$ , overall computational cost and wall-clock time can be significantly reduced when $\mu$ is not too small (i.e., the subspace is large enough to consistently capture gradient signal).

Practical guidelines [(Omiya et al., 28 Jan 2026), §7] include:

Stepsize: $\bar{\eta} \approx r/(dL)$ to offset smoothness penalty
Subspace dimension $r$ : Choose as a moderate fraction of $d$ (e.g., $r \approx d/10$ ), or aligned with “effective rank” of the Hessian or gradient empirical covariance
Minibatch size $\bar{B}$ : Proportional to $T$ , to ensure concentration in heavy-tailed or high-noise regimes

While RS-NSGD's theoretical properties are foregrounded in (Omiya et al., 28 Jan 2026), its technical lineage ties into a burgeoning ecosystem of subspace, sparsification, or quantization techniques for large-scale training:

RS-SGD with redraw-every-step projections significantly improves robustness and communication efficiency over fixed-projection methods (Gressmann et al., 2020)
Memory-efficient LLM optimization via subspace methods, including RS-SGD and GrassWalk/GrassJump, routinely achieves $50$--$70$% reductions in optimizer state and activation memory (Chen et al., 11 Feb 2025, Rajabi et al., 2 Oct 2025)
Random sparsification in privacy-preserving or distributed SGD leverages similar stochastic-subsampling to reduce communication and increase adversarial robustness (Zhu et al., 2021, Saha et al., 2021)

No empirical benchmarks for RS-NSGD-specific implementations are provided in (Omiya et al., 28 Jan 2026), but practical deployments are plausible in large-batch, high-dimensional, heavy-tailed regimes where standard (full-dim) normalized SGD is unstable or inefficient.

6. Limitations, Open Challenges, and Future Directions

Several limitations and open questions remain:

The success of RS-NSGD depends on subspace dimension parameter $r$ : too small $r$ degrades convergence due to insufficient gradient signal; too large $r$ undermines computational savings.
The theoretical iteration count is inflated by factors that depend polynomially on $d/r$ and $\mu$ ; optimizing these via data-dependent or adaptive subspace selection is a target for future research (Chen et al., 11 Feb 2025, Rajabi et al., 2 Oct 2025).
The analysis suggests tight high-probability bounds depend on concentration properties of projected gradients and minibatch noise; sharpened bounds with relaxed assumptions merit further work.

Possible future improvements include adaptive rank selection, integration of second-order information in the subspace, blockwise or structured randomization to further compress memory, and empirical evaluation on heavy-tailed tasks and modern large-scale neural architectures.

Principal References: