Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spectral Sphere Optimizer (SSO)

Updated 14 January 2026
  • Spectral Sphere Optimizer (SSO) is an optimization method that enforces strict spectral-norm constraints on both weights and updates, ensuring bounded activations under the Maximal Update Parametrization (μP).
  • It employs a constrained steepest-descent formulation that computes updates in the tangent space of the spectral sphere, resulting in rapid convergence and width-invariant learning dynamics.
  • Through efficient parallelization in platforms like Megatron, SSO outperforms traditional optimizers such as AdamW and Muon, offering improved stability and precise scaling in massive language models and MoE architectures.

The Spectral Sphere Optimizer (SSO) is an optimization method specifically designed for large-scale model training to achieve rapid convergence, rigorous stability, and strict scaling alignment under Maximal Update Parametrization (μP). SSO enforces simultaneous spectral-norm constraints on both weights and updates in each module, ensuring that model activations remain bounded and learning dynamics are width-invariant. Through a theoretically precise constrained steepest-descent formulation, SSO obtains updates that reside in the tangent space of the spectral sphere and applies an efficient parallel implementation suitable for massive LLMs and mixture-of-experts architectures (Xie et al., 13 Jan 2026).

1. Optimization Objective and Rationale

The optimization objective in SSO is to combine the fast convergence of steepest descent (in the spectral norm) with μP’s strict width-dependent activation control. Traditional optimizers such as AdamW allow weight drift, which leads to unbounded activation growth and degraded feature learning. Muon, by enforcing only the spectral norm of the update (ΔW2\|\Delta W\|_2), remains only “half-aligned” with μP, as it allows the weights W2\|W\|_2 to drift. SSO resolves this by solving the exact constrained optimization problem at each step: maxΔWΔW,WL(W)subject toW2=R,    ΔW2=ηR,\max_{\Delta W} \langle \Delta W, \nabla_W L(W) \rangle \quad \text{subject to} \quad \|W\|_2 = R,\;\; \|\Delta W\|_2 = \eta R, where R=Θ(dout/din)R = \Theta(\sqrt{d_\text{out}/d_\text{in}}), ensuring that both weights and updates live on spectral spheres scaled in accordance with μP theory. This achieves a fully μP-aligned optimization process, yielding bounded activations and stable training dynamics (Xie et al., 13 Jan 2026).

2. Spectral-Sphere Constraint Formulation

For any two-dimensional parameter matrix WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}, the spectral radius is fixed: W2=R,\|W\|_2 = R, with RR chosen as Θ(dout/din)\Theta(\sqrt{d_\text{out}/d_\text{in}}) to ensure operator-norm stability under μP. The update is decomposed as ΔW=ηRφ\Delta W = \eta R \varphi with φ2=1\|\varphi\|_2 = 1, and the constrained step requires that both the weight and the update after the step reside precisely on their respective spectral spheres: W+ΔW2=R,ΔW2=ηR.\|W + \Delta W\|_2 = R, \quad \|\Delta W\|_2 = \eta R. In practice, a first-order (tangent space) constraint is solved exactly, followed by a retraction step to ensure the post-update weight matrix precisely satisfies W2=R\|W\|_2 = R. This locking of the spectral norm after each update ensures activations remain strictly bounded (Xie et al., 13 Jan 2026).

3. Algorithmic Derivation and Update Mechanics

The constrained steepest descent step is derived as follows. Let G=WL(W)G = \nabla_W L(W). Introducing φ=ΔW/(ηR)\varphi = \Delta W / (\eta R), the update is found by solving: maxφ φ,Gs.t.φ2=1,φ,W=0,\max_{\varphi}\ \langle \varphi, G \rangle \quad \text{s.t.} \quad \|\varphi\|_2 = 1,\, \langle \varphi, W \rangle = 0, where the second constraint ensures the update direction lies in the tangent space to the spectral sphere at WW. The Lagrangian becomes L(φ,λ)=φ,G+λφ,WL(\varphi,\lambda) = \langle \varphi, G \rangle + \lambda \langle \varphi, W \rangle. The maximizer, for fixed λ\lambda, is the matrix-sign of G+λWG + \lambda W: φ(λ)=msign(G+λW),\varphi^*(\lambda) = \text{msign}(G + \lambda W), where if X=UΣVX = U\Sigma V^\top is the top-rr SVD of XX, msign(X)=U:,1:rV:,1:r\text{msign}(X) = U_{:,1:r} V_{:,1:r}^\top. The optimal λ\lambda^* is chosen such that the tangent constraint φ,W=0\langle \varphi, W \rangle = 0 holds; this is solved numerically by bisection: h(λ)=W,msign(G+λW)=0.h(\lambda) = \langle W, \text{msign}(G + \lambda W) \rangle = 0. The update is then: ΔW=ηRφ,WW+ΔW,\Delta W = -\eta R\, \varphi^*,\quad W \leftarrow W + \Delta W, followed by a retraction: WRW2W.W \leftarrow \frac{R}{\|W\|_2} W. A compact expression, utilizing the SVD W=UΣVW = U\Sigma V^\top and leading pair (u1,v1)(u_1, v_1), is the tangent projector P=Iu1v1P = I - u_1 v_1^\top, yielding

ΔW=ηPG,\Delta W = -\eta P G,

projecting GG onto the tangent space and stepping in that direction up to the λ\lambda-shift (Xie et al., 13 Jan 2026).

4. μP Scaling Alignment and Theoretical Guarantees

The optimizer achieves strict μP alignment by enforcing both the spectral norm constraints on weights and updates: W2=Θ(doutdin),ΔW2=Θ(doutdin).\|W\|_2 = \Theta\left(\sqrt{\frac{d_\text{out}}{d_\text{in}}}\right), \quad \|\Delta W\|_2 = \Theta\left(\sqrt{\frac{d_\text{out}}{d_\text{in}}}\right). This provides width-invariant scaling, so that the learning rate may be kept constant across model sizes without the risk of activation explosion. In contrast, Muon only enforces ΔW2\|\Delta W\|_2 but not W2\|W\|_2, permitting hidden activation drift (“half-aligned”), whereas SSO exactly preserves both. The result is strictly bounded activations and improved stability, especially relevant in large-scale LLM and MoE settings (Xie et al., 13 Jan 2026).

5. Parallelized Implementation in Megatron

SSO is implemented in the Megatron-GPT codebase using several engineering strategies for efficiency and scalability:

  • Atomic Module Sharding: Each logical sub-matrix (such as attention Q, K, or V projections and SwiGLU gate/up matrices) is separately constrained and updated, avoiding over- or under-constraint. Sharding is performed at the atomic module level across data-parallel (DP) ranks.
  • Load Balancing: Due to varying compute required for the spectral solver (bracketing and bisecting for λ\lambda^*), modules are zigzag assigned to DP ranks to interleave large and small workloads.
  • Synchronization and Communication: After the spectral update, parameters are synchronized via iterative All-Gather on the atomic shards.
  • Kernel and Precision Optimizations: Adaptive msign kernels—for small matrices (<512×512<512\times 512), JIT torch.addmm, and for large, Triton SYRK-optimized Newton–Schulz. Multi-streaming hides launch latency on small independent modules. Power iteration for W2\|W\|_2 is in BF16, msign in FP32 with 8 iterations, with u1,v1u_1, v_1 cached for rapid convergence (Xie et al., 13 Jan 2026).

6. Empirical Evaluation and Performance Analysis

Comprehensive empirical results demonstrate SSO’s performance advantages:

Model / Setting AdamW Muon SSO
Dense 1.7B (100B tokens) 23K steps (loss 2.588), 54.75% acc 20.3K steps (loss 2.588), 55.26% acc 18.7K steps (loss 2.588), 56.35% acc
MoE 8B-A1B (Router Max-Violation) ∼0.20 with spikes ∼0.10 ∼0.02 (consistent)
DeepNet 200-layer Spiky, slow More stable, tail heavy Smoothest, lowest-loss

SSO accelerates convergence (e.g., 19% fewer steps on Dense 1.7B for target loss), produces higher downstream accuracy, yields tightly controlled router load variance in MoEs, and achieves smoothly descended training loss in deep architectures. In Dense-1.7B, attention AbsMax and FFN RMS remain at Θ(1)\Theta(1) under SSO, while AdamW shows activation growth by 100×\sim100\times (Xie et al., 13 Jan 2026).

7. Ablation Studies and Hyperparameter Strategies

Ablation studies reveal optimal choices and practical considerations:

  • Spectral Radius Scaling: R=cdout/dinR = c\sqrt{d_\text{out}/d_\text{in}}, with c2.0c \approx 2.0 yielding best loss. AbsMax scales linearly with cc; RMS scales as cαc^\alpha, α0.5\alpha\approx 0.5.
  • Learning Rate Scaling: Spectral μP scaler R=dout/dinR = \sqrt{d_\text{out}/d_\text{in}} outperforms both Align-Adam-RMS and Spectral-Kaiming.
  • Module Granularity: Split attention QKV per head for maximal gain; for SwiGLU, keep gate/up separate by default.
  • Solver Tolerance and Iterations: λ\lambda-root tolerance ϵ2×104\epsilon\approx 2\times10^{-4}, max 20 iterations; bracket 1–3 steps, bisection 5–7 steps. msign with 8 Newton–Schulz iterations in FP32 (5 in BF16, negligible precision loss).
  • Momentum and Weight Decay: Nesterov momentum β0.9\beta\approx 0.9; no explicit weight decay for hidden 2D weights (enforced by retraction). 1D parameters may optionally use small decay; results are mixed.

These studies support the prescription that stability and transferability are optimized by enforcing spectral constraints at the most granular and theoretically justified level (Xie et al., 13 Jan 2026).


SSO is thus the uniquely defined optimizer that solves the exact constrained steepest-descent step on the spectral sphere, realizes strict μP scaling on both weights and updates, is practical at large scale via efficient parallelization, and outperforms or matches established optimizers such as AdamW and Muon while guaranteeing strict activation and routing stability mandatory for robust LLM and MoE training (Xie et al., 13 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spectral Sphere Optimizer (SSO).