Spectral Sphere Optimizer (SSO)

Updated 14 January 2026

Spectral Sphere Optimizer (SSO) is an optimization method that enforces strict spectral-norm constraints on both weights and updates, ensuring bounded activations under the Maximal Update Parametrization (μP).
It employs a constrained steepest-descent formulation that computes updates in the tangent space of the spectral sphere, resulting in rapid convergence and width-invariant learning dynamics.
Through efficient parallelization in platforms like Megatron, SSO outperforms traditional optimizers such as AdamW and Muon, offering improved stability and precise scaling in massive language models and MoE architectures.

The Spectral Sphere Optimizer (SSO) is an optimization method specifically designed for large-scale model training to achieve rapid convergence, rigorous stability, and strict scaling alignment under Maximal Update Parametrization (μP). SSO enforces simultaneous spectral-norm constraints on both weights and updates in each module, ensuring that model activations remain bounded and learning dynamics are width-invariant. Through a theoretically precise constrained steepest-descent formulation, SSO obtains updates that reside in the tangent space of the spectral sphere and applies an efficient parallel implementation suitable for massive LLMs and mixture-of-experts architectures (Xie et al., 13 Jan 2026).

1. Optimization Objective and Rationale

The optimization objective in SSO is to combine the fast convergence of steepest descent (in the spectral norm) with μP’s strict width-dependent activation control. Traditional optimizers such as AdamW allow weight drift, which leads to unbounded activation growth and degraded feature learning. Muon, by enforcing only the spectral norm of the update ( $\|\Delta W\|_2$ ), remains only “half-aligned” with μP, as it allows the weights $\|W\|_2$ to drift. SSO resolves this by solving the exact constrained optimization problem at each step: $\max_{\Delta W} \langle \Delta W, \nabla_W L(W) \rangle \quad \text{subject to} \quad \|W\|_2 = R,\;\; \|\Delta W\|_2 = \eta R,$ where $R = \Theta(\sqrt{d_\text{out}/d_\text{in}})$ , ensuring that both weights and updates live on spectral spheres scaled in accordance with μP theory. This achieves a fully μP-aligned optimization process, yielding bounded activations and stable training dynamics (Xie et al., 13 Jan 2026).

2. Spectral-Sphere Constraint Formulation

For any two-dimensional parameter matrix $W \in \mathbb{R}^{d_\text{out} \times d_\text{in}}$ , the spectral radius is fixed: $\|W\|_2 = R,$ with $R$ chosen as $\Theta(\sqrt{d_\text{out}/d_\text{in}})$ to ensure operator-norm stability under μP. The update is decomposed as $\Delta W = \eta R \varphi$ with $\|\varphi\|_2 = 1$ , and the constrained step requires that both the weight and the update after the step reside precisely on their respective spectral spheres: $\|W + \Delta W\|_2 = R, \quad \|\Delta W\|_2 = \eta R.$ In practice, a first-order (tangent space) constraint is solved exactly, followed by a retraction step to ensure the post-update weight matrix precisely satisfies $\|W\|_2 = R$ . This locking of the spectral norm after each update ensures activations remain strictly bounded (Xie et al., 13 Jan 2026).

3. Algorithmic Derivation and Update Mechanics

The constrained steepest descent step is derived as follows. Let $G = \nabla_W L(W)$ . Introducing $\varphi = \Delta W / (\eta R)$ , the update is found by solving: $\max_{\varphi}\ \langle \varphi, G \rangle \quad \text{s.t.} \quad \|\varphi\|_2 = 1,\, \langle \varphi, W \rangle = 0,$ where the second constraint ensures the update direction lies in the tangent space to the spectral sphere at $W$ . The Lagrangian becomes $L(\varphi,\lambda) = \langle \varphi, G \rangle + \lambda \langle \varphi, W \rangle$ . The maximizer, for fixed $\lambda$ , is the matrix-sign of $G + \lambda W$ : $\varphi^*(\lambda) = \text{msign}(G + \lambda W),$ where if $X = U\Sigma V^\top$ is the top- $r$ SVD of $X$ , $\text{msign}(X) = U_{:,1:r} V_{:,1:r}^\top$ . The optimal $\lambda^*$ is chosen such that the tangent constraint $\langle \varphi, W \rangle = 0$ holds; this is solved numerically by bisection: $h(\lambda) = \langle W, \text{msign}(G + \lambda W) \rangle = 0.$ The update is then: $\Delta W = -\eta R\, \varphi^*,\quad W \leftarrow W + \Delta W,$ followed by a retraction: $W \leftarrow \frac{R}{\|W\|_2} W.$ A compact expression, utilizing the SVD $W = U\Sigma V^\top$ and leading pair $(u_1, v_1)$ , is the tangent projector $P = I - u_1 v_1^\top$ , yielding

$\Delta W = -\eta P G,$

projecting $G$ onto the tangent space and stepping in that direction up to the $\lambda$ -shift (Xie et al., 13 Jan 2026).

4. μP Scaling Alignment and Theoretical Guarantees

The optimizer achieves strict μP alignment by enforcing both the spectral norm constraints on weights and updates: $\|W\|_2 = \Theta\left(\sqrt{\frac{d_\text{out}}{d_\text{in}}}\right), \quad \|\Delta W\|_2 = \Theta\left(\sqrt{\frac{d_\text{out}}{d_\text{in}}}\right).$ This provides width-invariant scaling, so that the learning rate may be kept constant across model sizes without the risk of activation explosion. In contrast, Muon only enforces $\|\Delta W\|_2$ but not $\|W\|_2$ , permitting hidden activation drift (“half-aligned”), whereas SSO exactly preserves both. The result is strictly bounded activations and improved stability, especially relevant in large-scale LLM and MoE settings (Xie et al., 13 Jan 2026).

5. Parallelized Implementation in Megatron

SSO is implemented in the Megatron-GPT codebase using several engineering strategies for efficiency and scalability:

Atomic Module Sharding: Each logical sub-matrix (such as attention Q, K, or V projections and SwiGLU gate/up matrices) is separately constrained and updated, avoiding over- or under-constraint. Sharding is performed at the atomic module level across data-parallel (DP) ranks.
Load Balancing: Due to varying compute required for the spectral solver (bracketing and bisecting for $\lambda^*$ ), modules are zigzag assigned to DP ranks to interleave large and small workloads.
Synchronization and Communication: After the spectral update, parameters are synchronized via iterative All-Gather on the atomic shards.
Kernel and Precision Optimizations: Adaptive msign kernels—for small matrices ( $<512\times 512$ ), JIT torch.addmm, and for large, Triton SYRK-optimized Newton–Schulz. Multi-streaming hides launch latency on small independent modules. Power iteration for $\|W\|_2$ is in BF16, msign in FP32 with 8 iterations, with $u_1, v_1$ cached for rapid convergence (Xie et al., 13 Jan 2026).

6. Empirical Evaluation and Performance Analysis

Comprehensive empirical results demonstrate SSO’s performance advantages:

Model / Setting	AdamW	Muon	SSO
Dense 1.7B (100B tokens)	23K steps (loss 2.588), 54.75% acc	20.3K steps (loss 2.588), 55.26% acc	18.7K steps (loss 2.588), 56.35% acc
MoE 8B-A1B (Router Max-Violation)	∼0.20 with spikes	∼0.10	∼0.02 (consistent)
DeepNet 200-layer	Spiky, slow	More stable, tail heavy	Smoothest, lowest-loss

SSO accelerates convergence (e.g., 19% fewer steps on Dense 1.7B for target loss), produces higher downstream accuracy, yields tightly controlled router load variance in MoEs, and achieves smoothly descended training loss in deep architectures. In Dense-1.7B, attention AbsMax and FFN RMS remain at $\Theta(1)$ under SSO, while AdamW shows activation growth by $\sim100\times$ (Xie et al., 13 Jan 2026).

7. Ablation Studies and Hyperparameter Strategies

Ablation studies reveal optimal choices and practical considerations:

Spectral Radius Scaling: $R = c\sqrt{d_\text{out}/d_\text{in}}$ , with $c \approx 2.0$ yielding best loss. AbsMax scales linearly with $c$ ; RMS scales as $c^\alpha$ , $\alpha\approx 0.5$ .
Learning Rate Scaling: Spectral μP scaler $R = \sqrt{d_\text{out}/d_\text{in}}$ outperforms both Align-Adam-RMS and Spectral-Kaiming.
Module Granularity: Split attention QKV per head for maximal gain; for SwiGLU, keep gate/up separate by default.
Solver Tolerance and Iterations: $\lambda$ -root tolerance $\epsilon\approx 2\times10^{-4}$ , max 20 iterations; bracket 1–3 steps, bisection 5–7 steps. msign with 8 Newton–Schulz iterations in FP32 (5 in BF16, negligible precision loss).
Momentum and Weight Decay: Nesterov momentum $\beta\approx 0.9$ ; no explicit weight decay for hidden 2D weights (enforced by retraction). 1D parameters may optionally use small decay; results are mixed.

These studies support the prescription that stability and transferability are optimized by enforcing spectral constraints at the most granular and theoretically justified level (Xie et al., 13 Jan 2026).

SSO is thus the uniquely defined optimizer that solves the exact constrained steepest-descent step on the spectral sphere, realizes strict μP scaling on both weights and updates, is practical at large scale via efficient parallelization, and outperforms or matches established optimizers such as AdamW and Muon while guaranteeing strict activation and routing stability mandatory for robust LLM and MoE training (Xie et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Controlled LLM Training on Spectral Sphere (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spectral Sphere Optimizer (SSO).

Spectral Sphere Optimizer (SSO)

1. Optimization Objective and Rationale

2. Spectral-Sphere Constraint Formulation

3. Algorithmic Derivation and Update Mechanics

4. μP Scaling Alignment and Theoretical Guarantees

5. Parallelized Implementation in Megatron

6. Empirical Evaluation and Performance Analysis

7. Ablation Studies and Hyperparameter Strategies

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Spectral Sphere Optimizer (SSO)

1. Optimization Objective and Rationale

2. Spectral-Sphere Constraint Formulation

3. Algorithmic Derivation and Update Mechanics

4. μP Scaling Alignment and Theoretical Guarantees

5. Parallelized Implementation in Megatron

6. Empirical Evaluation and Performance Analysis

7. Ablation Studies and Hyperparameter Strategies

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research