RS-NSGD: Randomized Subspace Normalized SGD
- The paper introduces RS-NSGD, which projects gradients onto random low-dimensional subspaces with normalization to ensure robust progress under heavy-tailed noise.
- RS-NSGD reduces per-iteration computational cost from full-dimensional updates to O(r) operations while providing both in-expectation and high-probability convergence guarantees.
- The algorithm offers oracle complexity improvements and practical guidelines for selecting stepsize, minibatch size, and subspace dimension for efficient training.
Randomized Subspace Normalized SGD (RS-NSGD) is a stochastic optimization algorithm designed to reduce per-iteration computational and memory costs in high-dimensional nonconvex learning problems by performing stochastic gradient updates within randomly chosen low-dimensional subspaces and incorporating direction normalization. This approach achieves both in-expectation and high-probability convergence guarantees, including under heavy-tailed gradient noise, and yields oracle complexity improvements over full-dimensional normalized SGD in appropriate regimes. RS-NSGD synthesizes developments from the broader class of randomized subspace stochastic gradient algorithms (RS-SGD), which project or sparsify gradients for efficiency, and advances this paradigm with normalization strategies suited to machine learning scenarios where heavy-tailed stochastic gradients are prevalent (Omiya et al., 28 Jan 2026).
1. Algorithmic Structure and Update Rule
RS-NSGD targets the general stochastic nonconvex optimization problem
where is large and direct full-dimensional stochastic gradient descent (SGD) can be computationally or memory prohibitive.
At iteration , RS-NSGD samples:
- A Haar-distributed random subspace, , with
- A minibatch
The stochastic (mini-batch averaged) gradient is .
RS-NSGD then forms the subspace-projected update: It normalizes this direction (details in [(Omiya et al., 28 Jan 2026), §4.2]), and the update step is
where is a stepsize and is a regularization parameter. The normalization is crucial for robust progress, especially when the underlying gradient noise is heavy-tailed or the norm of projected gradients varies widely, mitigating the risk of update explosions or stagnation.
This structure generalizes prior randomized subspace methods by integrating explicit normalization in the subspace direction.
2. Theoretical Assumptions and High-Probability Guarantees
The theoretical analysis [(Omiya et al., 28 Jan 2026), §2.2–3.2] is built upon standard conditions for nonconvex stochastic optimization:
- is lower bounded ()
- -smoothness:
- Unbiasedness of the stochastic gradient:
- Noise model: For RS-NSGD, only bounded -th moment of the noise is required (less restrictive than sub-Gaussianity)
A key innovation is the derivation of non-asymptotic, high-probability convergence bounds under both sub-Gaussian and heavy-tailed noise [(Omiya et al., 28 Jan 2026), §4]. The main convergence result states: for any , with suitable choices of minibatch size , stepsize , and for large enough,
where encodes the probability that the random subspace captures sufficient gradient energy, and depends on initial suboptimality. Notably, this bound matches—in terms of order—the best-known high-probability guarantees for full-dimensional SGD under sub-Gaussian noise, but extends to heavier-tailed settings via normalization.
3. Comparison to Prior Randomized Subspace Methods
RS-NSGD shares the basic descent-in-random-subspace paradigm with established RS-SGD variants (Gressmann et al., 2020, Chen et al., 11 Feb 2025, Rajabi et al., 2 Oct 2025). However, previous analyses almost exclusively establish in-expectation rates, and their bounds rely on sub-Gaussian, bounded-variance noise (Chen et al., 11 Feb 2025, Gressmann et al., 2020). High-probability and robust heavy-tail results were previously unavailable or limited.
The table below synthesizes the core distinctions:
| Method | Update Rule | Convergence Guarantee | Noise Model | Memory Reduction |
|---|---|---|---|---|
| RS-SGD | In-expectation | Sub-Gaussian | Yes | |
| RS-NSGD | High-probability, in-expectation | Bounded -th moment / heavy-tailed | Yes |
RS-NSGD's normalization is particularly crucial for non-Gaussian, heavy-tailed regimes often encountered in deep learning, and its analysis demonstrates oracle complexity improvements over full-dimensional Normalized SGD, where each step incurs cost but RS-NSGD incurs at the expense of more iterations.
4. Oracle Complexity, Parameterization, and Practical Implications
The iteration complexity for achieving with high probability is
Per-iteration cost is oracle calls, versus for full-dimensional Normalized SGD. For moderate to large and practical , overall computational cost and wall-clock time can be significantly reduced when is not too small (i.e., the subspace is large enough to consistently capture gradient signal).
Practical guidelines [(Omiya et al., 28 Jan 2026), §7] include:
- Stepsize: to offset smoothness penalty
- Subspace dimension : Choose as a moderate fraction of (e.g., ), or aligned with “effective rank” of the Hessian or gradient empirical covariance
- Minibatch size : Proportional to , to ensure concentration in heavy-tailed or high-noise regimes
5. Context, Empirical Results, and Related Applications
While RS-NSGD's theoretical properties are foregrounded in (Omiya et al., 28 Jan 2026), its technical lineage ties into a burgeoning ecosystem of subspace, sparsification, or quantization techniques for large-scale training:
- RS-SGD with redraw-every-step projections significantly improves robustness and communication efficiency over fixed-projection methods (Gressmann et al., 2020)
- Memory-efficient LLM optimization via subspace methods, including RS-SGD and GrassWalk/GrassJump, routinely achieves $50$--$70$% reductions in optimizer state and activation memory (Chen et al., 11 Feb 2025, Rajabi et al., 2 Oct 2025)
- Random sparsification in privacy-preserving or distributed SGD leverages similar stochastic-subsampling to reduce communication and increase adversarial robustness (Zhu et al., 2021, Saha et al., 2021)
No empirical benchmarks for RS-NSGD-specific implementations are provided in (Omiya et al., 28 Jan 2026), but practical deployments are plausible in large-batch, high-dimensional, heavy-tailed regimes where standard (full-dim) normalized SGD is unstable or inefficient.
6. Limitations, Open Challenges, and Future Directions
Several limitations and open questions remain:
- The success of RS-NSGD depends on subspace dimension parameter : too small degrades convergence due to insufficient gradient signal; too large undermines computational savings.
- The theoretical iteration count is inflated by factors that depend polynomially on and ; optimizing these via data-dependent or adaptive subspace selection is a target for future research (Chen et al., 11 Feb 2025, Rajabi et al., 2 Oct 2025).
- The analysis suggests tight high-probability bounds depend on concentration properties of projected gradients and minibatch noise; sharpened bounds with relaxed assumptions merit further work.
Possible future improvements include adaptive rank selection, integration of second-order information in the subspace, blockwise or structured randomization to further compress memory, and empirical evaluation on heavy-tailed tasks and modern large-scale neural architectures.
Principal References:
- "Convergence Analysis of Randomized Subspace Normalized SGD under Heavy-Tailed Noise" (Omiya et al., 28 Jan 2026)
- See also (Chen et al., 11 Feb 2025, Gressmann et al., 2020, Rajabi et al., 2 Oct 2025, Saha et al., 2021), and (Zhu et al., 2021) for related randomized subspace optimization frameworks.