Randomized Subspace Normalized SGD

Updated 29 January 2026

RS-NSGD is a stochastic optimization algorithm that leverages random subspace selection and direction normalization to control heavy-tailed noise.
It significantly reduces memory and computation costs by operating in a lower-dimensional subspace while maintaining high-probability convergence guarantees.
RS-NSGD demonstrates improved oracle complexity compared to full-dimensional methods, making it ideal for large-scale and distributed nonconvex problems.

Randomized Subspace Normalized SGD (RS-NSGD) is a stochastic optimization algorithm that integrates direction normalization into randomized subspace stochastic gradient descent, offering favorable convergence rates—especially under heavy-tailed noise conditions prevalent in modern machine learning. By combining random subspace selection with normalization, RS-NSGD demonstrates improved oracle complexity compared to full-dimensional normalized SGD and enables substantial reductions in memory and per-iteration computation cost. These properties make it particularly suitable for large-scale nonconvex optimization problems and distributed scenarios, where conventional methods face significant challenges.

1. Algorithmic Foundations and Update Rule

RS-NSGD addresses the stochastic nonconvex optimization problem

$\min_{x\in\mathbb R^d} F(x) = \mathbb E_{\xi}\left[f(x; \xi)\right],$

where both the parameter dimension $d$ and sample variability are large. At each iteration $k$ , the algorithm proceeds as follows:

Sample a Haar-distributed random subspace $P_k \in \mathbb R^{d \times r}$ , with $r \ll d$ .
Draw a minibatch $\{\xi_k^j\}_{j=1}^{\bar B}$ and compute the averaged stochastic gradient $g_k = \frac{1}{\bar B} \sum_{j=1}^{\bar B} \nabla f(x_k; \xi_k^j)$ .
Project the gradient into the subspace: $u_k = P_k^\top g_k$ .
Normalize the direction within the subspace in accordance with $p$ -norm moment bounds on the noise (exact normalization rule may vary by implementation).
Update via subspace descent: $x_{k+1} = x_k - \bar\eta\, P_k\, u_k,$ with stepsize $\bar\eta>0$ that matches the smoothness-induced scaling.

This protocol leverages the statistical and computational advantages of random projections for gradient estimates, while normalization compensates for the deleterious effects of heavy-tailed noise distributions, which can dramatically skew progress in unconstrained stochastic descent.

2. Noise Models, Assumptions, and High-Probability Analysis

The theoretical guarantees of RS-NSGD are predicated on mild assumptions:

Objective function $F$ is $L$ -smooth.
Stochastic gradients are unbiased, i.e., $\mathbb E[\nabla f(x;\xi)] = \nabla F(x)$ .
Noise is heavy-tailed, with bounded $p$ -th moments for some $p > 2$ .

Unlike prior work that focuses on convergence in expectation or under sub-Gaussian noise, RS-NSGD admits both high-probability and in-expectation convergence results even when gradient noise exhibits heavy tails. Specifically, letting $\mu$ denote the signal fraction from random subspace projection, $\Delta_0 = F(x_0) - F^*$ , and $\sigma^2$ characterize the noise scale, RS-NSGD yields the following (Theorem 3.1 under sub-Gaussian noise, extended to bounded $p$ -moments):

$\min_{0 \leq k < T} \|\nabla F(x_k)\|^2 \leq \tilde{O}\left( \frac{d^3}{\mu^2 r}\,\Delta_0 L \sigma^2\,\varepsilon^{-4} \right)$

with probability at least $1 - \delta$ for $T \gtrsim \mu^{-1} \log(1/\delta)$ , and smaller $\varepsilon$ attainable via larger minibatch and rank $r$ choices, subject to computational budget constraints (Omiya et al., 28 Jan 2026).

3. Normalization and Heavy-Tailed Noise

The direction normalization in RS-NSGD is motivated by empirical findings that stochastic gradients in large-scale machine learning are often heavy-tailed, violating sub-Gaussian hypotheses. The normalization step involves rescaling the projected gradient so that its norm is controlled; one common prescription is to divide the projected gradient $u_k$ by its $p$ -norm, yielding a directionally robust step even under high noise, i.e.,

$u^{\text{norm}}_k = \frac{u_k}{\|u_k\|_p}$

for appropriate $p$ , thereby attenuating the influence of outlier components. This modification is pivotal in achieving stronger concentration of the optimization trajectory and mitigating excessive variance from high-moment noise sources (Omiya et al., 28 Jan 2026).

4. Oracle Complexity and Comparative Efficiency

The oracle complexity of RS-NSGD compares favorably to both full-dimensional normalized SGD and randomized subspace SGD (RS-SGD):

RS-NSGD vs Full-Dim NSGD: The coordinate-oracle complexity for achieving an $\varepsilon$ -stationary point is lower for RS-NSGD when $r \ll d$ and $\mu$ is not too small. Empirically, practical choices (e.g., $r=d/10$ ) yield substantial savings in memory and communication.
RS-NSGD vs RS-SGD: RS-NSGD demonstrates improved high-probability rates and better robustness to noise scaling; the iteration complexity is augmented only by a factor $d^2/(\mu^2 r^2)$ in the subspace dimension, but per-iteration cost drops from $O(d)$ to $O(r)$ (Omiya et al., 28 Jan 2026).

5. Practical Guidelines and Implementation Considerations

Implementation of RS-NSGD requires tuning the following algorithmic parameters:

Subspace dimension ( $r$ ): Should reflect the effective smoothness rank of the problem; moderate fractions of $d$ balance accuracy and efficiency.
Stepsize ( $\bar\eta$ ): Theoretically optimal is $r/(dL)$ to cancel the smoothness penalty from projection.
Minibatch size ( $\bar B$ ): Scaling with $T$ or larger values reduces noise influence and realizes the optimal high-probability scaling.
Direction normalization: Choice of $p$ (commonly $p=2$ or $p=4$ ) dictated by observed tail behavior.

Efficient PRNG and matrix operations facilitate subspace sampling and gradient projection. When deployed in distributed or federated optimization, communication savings are realized by transmitting only the lower-dimensional projected directions, rather than full gradients.

6. Theoretical and Empirical Comparison Table

Variant	Convergence Rate	Memory/Comp Cost	Oracle Complexity
Full-Dim NSGD	$O(\varepsilon^{-4})$ w.h.p.	$O(d)$	$O(d\,\Delta_0 L \sigma^2\,\varepsilon^{-4})$
RS-SGD	$O(\varepsilon^{-4})$ w.h.p.	$O(r)$	$O(\frac{d^3}{\mu^2 r}\,\Delta_0 L \sigma^2\,\varepsilon^{-4})$
RS-NSGD	$O(\varepsilon^{-4})$ w.h.p., improved for $p>2$	$O(r)$	Lower than RS-SGD for bounded $p$ -moment and non-negligible $\mu$

The Editor's term "oracle complexity" refers to the total number of stochastic gradient evaluations needed to reach a prescribed stationarity threshold.

7. Significance and Context Within Randomized Subspace Methods

RS-NSGD builds on a growing family of randomized subspace optimization methods seeking scalable, communication- and memory-efficient algorithms for nonconvex training of large models. While classical RS-SGD provides strong expectation-based guarantees and substantial reductions in computational load per iteration (Chen et al., 11 Feb 2025), RS-NSGD adds normalization, enabling rigorous high-probability guarantees even when noise deviates from classical assumptions. This suggests that RS-NSGD is particularly well-suited to regimes where variance or heavy-tails are pronounced, such as large-batch training on real-world data.

A plausible implication is that direction normalization may become standard in future randomized and memory-efficient large-scale optimizers, especially for nonconvex objectives where heavy-tailed statistics are encountered routinely.

Markdown Report Issue Upgrade to Chat

References (2)

Convergence Analysis of Randomized Subspace Normalized SGD under Heavy-Tailed Noise (2026)

A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Subspace Normalized SGD (RS-NSGD).