Muon-NSGD: Matrix-Aware Gradient Optimizer

Updated 18 February 2026

Muon-NSGD is a matrix-aware, non-Euclidean optimizer that projects gradient updates onto the Stiefel manifold using an efficient Newton–Schulz approximation.
It replaces traditional Euclidean updates with spectral norm–constrained steps, enhancing stability and performance in large-batch transformer training.
The optimizer provides implicit spectral regularization and rigorous theoretical guarantees, yielding faster convergence and compute efficiency compared to adaptive methods.

Muon-NSGD (Natural Spectral Gradient Descent) is a matrix-aware, non-Euclidean gradient-based optimizer designed for efficient, stable training of neural networks—particularly transformer architectures—by leveraging geometric properties of matrix-structured parameters. The method replaces traditional Euclidean updates with steps adapted to the spectral structure of matrix weights, projecting gradient-based momenta onto the Stiefel manifold via the matrix-sign function, and employs efficient Newton–Schulz-based approximations to avoid explicit singular value decompositions. Muon-NSGD has demonstrated strong empirical performance in large-batch regimes and when combined with contemporary architectural components such as Mixture-of-Experts (MoE) and Multi-Head Latent Attention (MLA), and is supported by rigorous theoretical guarantees of stability, implicit spectral regularization, and convergence under modern nonconvex settings (Mehta et al., 29 Sep 2025).

1. Fundamental Algorithmic Structure

Muon-NSGD constructs its update by applying a momentum-smoothed gradient transformation and projecting the resulting matrix onto the Stiefel manifold, thus ensuring the step direction is orthogonal and has spectral norm 1. Mathematically, for a matrix parameter $W_t$ at layer $\ell$ , the iteration proceeds:

Momentum update:

$M_t = \beta M_{t-1} + (1-\beta) G_t$

where $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ is the stochastic gradient, $\beta$ is the momentum parameter.

Matrix-sign projection (orthogonalization):

$U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$

where $M = U \Sigma V^\top$ is the SVD, and $r$ is the rank. This operation enforces that all singular values of $U_t$ are equal to 1, guaranteeing $\|U_t\|_2 = 1$ .

Parameter update (with weight decay):

$\ell$ 0

with learning rate $\ell$ 1 and weight decay $\ell$ 2.

This process performs a natural gradient step on the Stiefel manifold and acts as a steepest descent step under the spectral norm (Mehta et al., 29 Sep 2025, Li et al., 5 Feb 2025).

2. Efficient Computation: Newton–Schulz Orthogonalization

Direct computation of the matrix sign via full SVD is computationally prohibitive. Instead, Muon-NSGD employs a truncated Newton–Schulz iteration to approximate $\ell$ 3. The iteration with coefficients $\ell$ 4 (optimized to control the contraction of singular values) is

$\ell$ 5

$\ell$ 6

for a small fixed $\ell$ 7 (typically $\ell$ 8) and coefficients $\ell$ 9. The result is re-scaled (e.g., by $M_t = \beta M_{t-1} + (1-\beta) G_t$ 0 with $M_t = \beta M_{t-1} + (1-\beta) G_t$ 1) to match the norm statistics of AdamW's step, mitigating variational instability during early training (Mehta et al., 29 Sep 2025, Kim et al., 27 Jan 2026).

This approach achieves near-optimal orthogonalization and removes the $M_t = \beta M_{t-1} + (1-\beta) G_t$ 2 penalty characteristic of vector-based momentum methods, while avoiding the overhead of SVD, with doubly-exponential contraction of the approximation error as the number and degree of Newton–Schulz steps increase. In practice, $M_t = \beta M_{t-1} + (1-\beta) G_t$ 3 yields negligible deviation from exact polar iteration, while maintaining high efficiency (Kim et al., 27 Jan 2026).

3. Theoretical Guarantees and Spectral Regularization

Muon-NSGD imposes implicit spectral regularization through its update, stabilizing the optimization trajectory by flattening the gradient spectrum. The projection step solves

$M_t = \beta M_{t-1} + (1-\beta) G_t$ 4

which ensures that no single direction in the parameter space is allowed to dominate, directly countering gradient explosion. The optimizer is additionally equivalent to computing the direction of steepest descent under the spectral norm constraint:

$M_t = \beta M_{t-1} + (1-\beta) G_t$ 5

which leads to robust control over the effective step size and direction.

For square matrices, Muon-NSGD's step coincides with a Riemannian (natural) gradient on the Stiefel manifold under the canonical metric, and $M_t = \beta M_{t-1} + (1-\beta) G_t$ 6 acts as a geodesic retraction of the ambient gradient projected onto the tangent space (Mehta et al., 29 Sep 2025).

Muon-NSGD obtains $M_t = \beta M_{t-1} + (1-\beta) G_t$ 7 convergence in the stochastic nonconvex setting with adaptive learning rate schedules and remains provably stable for large-batch regimes. Notably, the Newton–Schulz approximation maintains the same convergence rate as the SVD-polar ideal, up to a constant factor vanishing doubly-exponentially fast in the number of steps and degree of polynomial approximation (Mehta et al., 29 Sep 2025, Kim et al., 27 Jan 2026, Li et al., 5 Feb 2025, Sato et al., 2 Jul 2025).

4. Relationship to the Spectral-Compression Family and Non-Euclidean Descent

Muon-NSGD is the $M_t = \beta M_{t-1} + (1-\beta) G_t$ 8 endpoint of a broader spectral-transformation family: $M_t = \beta M_{t-1} + (1-\beta) G_t$ 9 with $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 0, interpolating between standard updates ( $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 1, identity), partial compression ( $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 2), and full flattening ( $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 3, Muon). The $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 4 transform outputs the “polar factor” $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 5, which enforces unit singular values, thus flattening the spectrum completely.

Empirical findings indicate that Muon-NSGD (p=0, momentum input) dramatically stabilizes standard momentum SGD, widening its learning-rate stability range. However, the advantage relative to adaptive optimizers such as Adam is context-dependent: under second-moment (RMS) normalization, as in Adam, full spectral flattening may degrade performance, while partial spectral compression ( $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 6) often achieves optimal trade-offs (Qi et al., 4 Feb 2026).

Within the non-Euclidean gradient descent and LMO (Linear Minimization Oracle) frameworks, Muon-NSGD’s update corresponds to steepest-descent under the spectral norm, and can be interpreted as constrained or regularized descent on a product norm that aggregates spectral and (for non-matrix blocks) adaptive norms (e.g., $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 7, AdaGrad- or Adam-type norms) (Crawshaw et al., 10 Oct 2025, Gruntkowska et al., 1 Oct 2025).

5. Empirical Performance and Practical Implementation

Muon-NSGD achieves high data efficiency and compute optimality, especially with modern architectural advances. Empirically, Muon reaches target loss with $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 8– $G_t = \nabla_{W_\ell} \mathcal L(W_t)$ 9 of the training computation of AdamW at matched or improved final perplexity (Mehta et al., 29 Sep 2025). When combined with MLA and MoE, configurations such as MLA+MoE+Muon achieve up to $\beta$ 0 memory reduction and $\beta$ 1 inference speedup alongside $\beta$ 2– $\beta$ 3 perplexity improvements. In Transformer kernels, Muon scales efficiently from $\beta$ 4M to $\beta$ 5M parameter decoders.

In the context of “grokking” experiments, Muon-NSGD led to a statistically significant acceleration of generalization onset compared to AdamW, reducing the mean “grokking” epoch from $\beta$ 6 to $\beta$ 7 across modular arithmetic and related tasks (t = 5.0175, p = 6.33e-08) (Tveit et al., 22 Apr 2025).

The optimizer is sensitive to hyperparameter selection in its vanilla (constrained) forms—particularly learning rates—yet variants such as MuonMax (a regularized version mixing spectral and adaptive norms in an $\beta$ 8 product) and Momo-regularized Muon substantially widen the stable regions and facilitate practical deployment (Crawshaw et al., 10 Oct 2025).

Summary of recommended hyperparameters for canonical Muon-NSGD (see (Mehta et al., 29 Sep 2025)):

Learning-rate decay: $\beta$ 9 or cosine schedule.
Momentum: $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 0.
Newton–Schulz steps: $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 1, $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 2.
RMS scaling: $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 3; match RMS of AdamW to prevent early-stage instability.
Weight decay: $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 4.
Gradient clipping: global norm $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 5.
Precision: mixed bfloat16 with float32 accumulators.

6. Variants, Distributed Implementations, and Extensions

Several Muon-NSGD variants address optimizer robustness, distributed computation, and computational bottlenecks:

MuonMax: Replaces the strict spectral-ball update with a regularized step sensitive to aggregated spectral- and adaptive-norm magnitudes, exhibiting much greater learning-rate robustness while preserving strong generalization (Crawshaw et al., 10 Oct 2025).
Momo-regularized Muon: Incorporates model-based momentum and error-clipping using a “truncated model” framework for effective self-tuning of step sizes, leading to hyperparameter-insensitivity.
EF21-Muon: Provides the first convergence-guaranteed, communication-efficient distributed framework for non-Euclidean LMO-based optimizers, leveraging bidirectional error-feedback with compression. EF21-Muon can recover Muon-NSGD as a special case in the uncompressed limit and retains optimal convergence rates even under non-Euclidean smoothness structures and $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 6-layerwise regularity (Gruntkowska et al., 1 Oct 2025).

Momentum Variance Reduction (MVR) techniques further accelerate Muon-NSGD, reducing complexity from $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 7 to $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 8 in the non-convex regime. When integrated in a block-wise fashion (Gluon-MVR), these approaches achieve improvements in iteration complexity and time-to-accuracy, as confirmed by controlled LLM pretraining experiments (Qian et al., 18 Dec 2025).

7. Theoretical and Practical Trade-offs

Muon-NSGD occupies a unique Pareto frontier in the optimizer landscape: it achieves spectral regularization without requiring expensive per-iteration SVD, scales efficiently with large parameter matrices, and delivers substantial stabilization benefits for sequence model training in regimes where standard SGD and Adam-type optimizers struggle with gradient anisotropy or explosion.

However, full spectral flattening can “crude-ify” the update: it discards spectrum magnitude information, which may under-step in “stiff” directions or over-step in noisy directions. In extensive benchmarking, partial spectral compression (e.g., $U_t = \mathrm{msign}(M_t), \qquad \mathrm{msign}(M) = U_{:,1:r} V_{:,1:r}^\top = M(M^\top M)^{-1/2}$ 9) is sometimes favored over full flattening ( $M = U \Sigma V^\top$ 0) in terms of final validation loss and stable learning-rate regime, especially when second-moment normalization is already present (e.g., Adam, AdamS). A plausible implication is that spectral normalization principally addresses stabilization, while performance optimization may require milder spectrum flattening or hybrid norm designs (Qi et al., 4 Feb 2026).

References

"Muon: Training and Trade-offs with Latent Attention and MoE" (Mehta et al., 29 Sep 2025)
"A Note on the Convergence of Muon" (Li et al., 5 Feb 2025)
"Delving into Muon and Beyond: Deep Analysis and Extensions" (Qi et al., 4 Feb 2026)
"Convergence of Muon with Newton-Schulz" (Kim et al., 27 Jan 2026)
"An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants" (Crawshaw et al., 10 Oct 2025)
"Muon in Associative Memory Learning: Training Dynamics and Scaling Laws" (Li et al., 5 Feb 2026)
"Analysis of Muon's Convergence and Critical Batch Size" (Sato et al., 2 Jul 2025)
"Muon is Provably Faster with Momentum Variance Reduction" (Qian et al., 18 Dec 2025)
"Muon Optimizer Accelerates Grokking" (Tveit et al., 22 Apr 2025)
"Error Feedback for Muon and Friends" (Gruntkowska et al., 1 Oct 2025)