Papers
Topics
Authors
Recent
Search
2000 character limit reached

Delving into Muon and Beyond: Deep Analysis and Extensions

Published 4 Feb 2026 in cs.LG, cs.AI, and cs.CL | (2602.04669v1)

Abstract: The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrix-shaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the p = 0 endpoint of a family of spectral transformations of the form U \boldsymbolΣ{p} V' , and consider additional variants with p = 1/2 , p = 1/4 , and p = 1 . These transformations are applied to both first-moment updates, as in momentum SGD, and to root-mean-square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS-normalized updates yield more stable optimization than first-moment updates. Moreover, while spectral compression provides strong stabilization benefits under first-moment updates, the Muon update (p = 0) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method. Our source code will be released at https://github.com/Ocram7/BeyondMuon.

Summary

  • The paper presents a unified spectral framework interpolating between standard optimizers and Muon’s orthogonalized updates.
  • It employs coupled Newton–Schulz iterations for efficient inverse matrix computations and verifies findings using controlled experiments on nanoGPT.
  • Results show that while Muon stabilizes momentum updates, excessive spectral flattening might reduce performance in RMS-normalized regimes.

Spectral Perspectives in Optimizer Design: A Deep Analysis of Muon and Its Extensions

Introduction

This work presents a comprehensive investigation into the Muon optimizer and its spectral generalizations for matrix-shaped parameters in neural networks. The authors introduce a unified framework for analyzing Muon, Adam, and intermediate variants, contextualizing orthogonalization and spectral compression within a rigorous, controlled experimental setting. The analysis aims to clarify Muon's mechanisms, its relationship to adaptive optimizers, and the practical/algorithmic consequences of spectral parametrization.

Unified Spectral Framework for Optimization

The central contribution is a parametric family of spectral transformations applied to update matrices, denoted by Ψp(Ot)=UΣpV\Psi_p(O_t) = U\Sigma^p V^\top for p[0,1]p \in [0,1]. This family interpolates between standard optimizers (p=1p=1; e.g., Adam, mSGD), fractional spectral compression (p=12p=\tfrac{1}{2}, p=14p=\tfrac{1}{4}), and the Muon/polar update (p=0p=0, i.e., orthogonalization).

The paper delineates two input regimes for these spectral updates:

  1. First-moment update (Otmom=MtO_t^{\text{mom}} = M_t): momentum SGD-like updates.
  2. RMS-normalized update (Otrms=MtVtO_t^{\text{rms}} = M_t \oslash \sqrt{V_t}): Adam-like updates that incorporate second-order statistics.

These inputs are paired with the four spectral exponents to produce eight configurations, enabling a systematic comparison among Muon, Adam, and novel interpolations.

Efficient Spectral Computation

A core challenge is the computational cost of directly performing singular value decompositions for large-scale matrices. To address this, the authors implement coupled Newton-Schulz iteration algorithms, enabling efficient calculation of inverse fractional matrix roots with only matrix multiplications:

  • For Ψ1/2\Psi_{1/2} and Ψ1/4\Psi_{1/4}, inverse square root and inverse fourth root matrices, respectively, are computed on the symmetric Gram matrix OtOtO_t^\top O_t, yielding scalable spectral compression applicable within deep architectures.

Controlled Experimental Protocol

All methods are evaluated in controlled conditions using nanoGPT (124M parameters), decoupling learning rates for spatially matrix-shaped and vector-shaped parameters, disabling weight decay, and omitting auxiliary stabilization (e.g., QK-Norm, QK-Clip). This design ensures that observed differences are strictly attributable to update rule variations.

Empirical Results: Spectral Compression Dynamics

The experiments yield the following primary findings:

  1. Muon stabilizes momentum updates: On first-moment inputs, Muon (p=0p=0) robustly defeats mSGD in stability across a broad learning-rate regime but does not universally achieve the highest final performance. Partial spectral compression (p=14p=\tfrac{1}{4}) can occasionally yield superior results at optimal learning rates. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: mSGDZ/Muon (U0VU^{0}V^\top) shows improved stability versus mSGD across learning rates on first-moment updates.

  1. RMS-normalized inputs are inherently more robust: Applying spectral compressions to RMS-normalized updates (Adam's paradigm) produces limited gains; full orthogonalization (AdamZ) often underperforms versus Adam and fractional spectrums (AdamS, AdamQ). Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: AdamZ (U0VU^{0}V^\top) demonstrates that Muon-style orthogonalization applied to RMS-normalized updates does not outperform Adam or fractional spectral variants.

  1. Spectral exponent regime influences anisotropy: Lowering pp compresses the singular spectrum, reducing condition number κp=κ(Ot)p\kappa_p = \kappa(O_t)^p; however, excessive flattening (p=0p=0) may amplify noise and suboptimally reweight gradient directions. Figure 3

    Figure 3: Comparison of four spectral exponents p{0,14,12,1}p\in\{0,\tfrac{1}{4},\tfrac{1}{2},1\} elucidates the spectrum flattening and its effect on learning dynamics for various input types.

Theoretical and Practical Implications

Robustness vs. Selectivity

Spectral orthogonalization is effective at mitigating instability generated by highly anisotropic update spectra in momentum methods. However, Muon discards all magnitude information, treating dominant and subdominant directions identically. This can be detrimental in latter-stage training, where low-rank signal directions may bear more information than directions with smaller singular values (typically containing noise).

RMS Normalization and Modern Transformer Training

Adam-style second-moment normalization already controls the per-coordinate gradient scaling, bounding update norms irrespective of raw gradient magnitude. Consequently, further spectral flattening yields only marginal stability improvements and in some cases, degrades performance by diminishing selective directionality.

Computational Overhead

The coupled Newton-Schulz framework enables practical matrix root operations, making the family of spectral optimizers computationally attractive. Nonetheless, for extremely large models, further efficiency innovations may be required.

Future Directions

The paper identifies open areas for exploration:

  • Regularization synergy: Integration of matrix-based optimizers with weight decay and other regularization protocols remains unaddressed.
  • Higher-order/structured spectral adaptation: There is room for scheme development that selectively compresses spectrum rather than uniformly flattening, to balance robustness and selective anisotropy exploitation.
  • Theoretical optimization landscape analysis: Detailed characterization of landscape navigation under varying spectral exponents and normalization regimes can inform both algorithmic and architectural choices in future deep networks.

Conclusion

This work systematizes spectral transformation methodologies in optimizer design, providing a thorough experimental and theoretical basis for understanding Muon's strengths and limitations. The analysis demonstrates that, in the absence of RMS normalization, Muon stabilizes optimization but does not universally surpass adaptive optimizers such as Adam. Fully flattened spectral updates are not inherently superior to partial compression or classical adaptive methods. Future improvements will likely arise from targeted spectrum regularization strategies and integrating matrix-based normalization with established regularization techniques.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

A simple explanation of “Delving into Muon and Beyond: Deep Analysis and Extensions”

Overview

This paper looks at two ways to train big neural networks faster and more reliably: the well-known Adam optimizer and a newer method called Muon. The authors try to understand how Muon really works, how it compares to Adam, and whether changing Muon’s main idea can make training better. They do this by studying the “shape” of gradient updates using a common lens (called a spectral perspective) and running careful, controlled experiments.

What questions did the paper ask?

The paper asks a few straightforward questions:

  • What is Muon doing under the hood, and how does it relate to Adam?
  • If we take Muon’s core idea (reshaping the update directions) and make it stronger or weaker, what happens?
  • Is Muon actually better than Adam, or is it useful only in specific situations?
  • Can we compute Muon-like updates efficiently on large models?

How did they study it?

To keep things clear, here are the main ideas they used:

  • Adam vs. Muon (basic idea in everyday language):
    • Adam: Think of Adam as a smart “cruise control” for training. It looks at how big the gradients are on each parameter and automatically turns the speed up or down to keep training steady. It uses two running averages: the first moment (an average of recent gradients) and the second moment (an average of squared gradients that measures typical size/variability). This second-moment “RMS-normalization” keeps updates from exploding.
    • Muon: Instead of adjusting each number separately, Muon looks at whole matrices of gradients and treats them like combinations of directions with different strengths (this comes from something like a “direction-strength” breakdown called SVD). Muon’s core move is to flatten the strengths so all directions have the same size. Think of it like an audio compressor that turns down loud parts and turns up quiet parts so everything is more even.
  • A family of “spectral” transformations:
    • The authors build a family of updates that smoothly moves from “do nothing” to “flatten everything,” controlled by a knob called p:
    • p = 1: keep the original update (like standard Adam or momentum SGD).
    • p = 1/2 or p = 1/4: partially compress the strong directions and lift the weak ones.
    • p = 0: fully flatten everything (this is Muon’s signature move).
    • This lets them test whether “a little flattening” is better than “flatten everything.”
  • Two places to apply these transformations:
    • Momentum-only updates (first moment): like using the average of recent gradients without scale control.
    • RMS-normalized updates (second moment, Adam-style): like using Adam’s “cruise control,” which already keeps sizes bounded and stable.
  • Efficient computation trick:
    • Doing these matrix operations directly can be slow. The authors use a method called coupled Newton–Schulz iteration, which is a fast, repeated “guess-and-improve” technique that avoids the heavy-duty SVD step. In simple terms: it’s a way to estimate certain matrix “square roots” using only multiplications, making the method practical on big models.
  • Controlled experiments:
    • They train a 124M-parameter GPT-2-like model (nanoGPT) on OpenWebText for 200k steps.
    • To be fair, they carefully separate learning rates for matrix parameters and vector parameters, turn off weight decay, and avoid extra tricks (like QK-Norm or QK-Clip) so they can isolate what the optimizer itself is doing.
    • They try different learning rates in a coarse-to-fine search to find stable and strong settings for each optimizer.

What did they find, and why is it important?

Here are the main results the authors saw:

  • Adam-style (RMS-normalized) updates are more stable than momentum-only updates.
    • When you already have Adam’s “cruise control,” training tends to be smoother and more predictable.
  • Muon (full flattening, p = 0) helps stabilize momentum-only updates but doesn’t beat Adam overall.
    • If you don’t normalize the size of updates (just use momentum), Muon’s flattening step can stop training from blowing up. However, when you compare Muon to Adam (which already normalizes), Muon doesn’t consistently win.
  • Partial flattening (p = 1/2 or 1/4) can sometimes be better than full flattening, especially with momentum-only updates.
    • Making strong directions “less strong” helps stability, but flattening them all the way can also suppress helpful signals and boost noise. A gentler touch (like p = 1/4 or p = 1/2) can strike a better balance.
  • Spectral transforms on Adam-style updates give modest gains at best.
    • Since Adam already keeps scales under control, extra “flattening” rarely helps and full flattening (p = 0) can even hurt by amplifying noisy directions.

Why this matters:

  • It clarifies what Muon is best at: stabilizing updates when you don’t already have Adam’s normalization.
  • It shows that Muon isn’t a magic replacement for Adam. Instead, it’s a useful form of “spectral normalization” for certain cases.
  • It suggests that second-moment (Adam-like) methods remain strong, and that “how much to flatten” is a tuning knob, not a one-size-fits-all fix.

What could this change or influence?

  • Practical training choices: If you’re using momentum-only methods and struggle with stability, adding Muon-like spectral compression (especially partial compression) can help. But if you’re already using Adam-style updates, Muon’s full flattening may not be worth it.
  • Design of future optimizers: The paper’s unified view (the p family) provides a clear framework for building and testing new optimizers that adjust how aggressively they flatten gradients. It also offers fast computation tricks to make these ideas practical.
  • Honest expectations: Muon is best understood as a powerful stabilizer for certain inputs, not a guaranteed upgrade over Adam.

Simple takeaway

  • Muon flattens the “direction strengths” of gradient updates to stabilize training, which helps a lot if you’re not already controlling update sizes.
  • Adam already controls sizes, so Muon’s full flattening doesn’t usually beat it and can sometimes make things worse by boosting noisy directions.
  • Gentle flattening (p = 1/2 or p = 1/4) can be a sweet spot, especially without Adam-style normalization.
  • In short: think of Muon as a good spectral normalization tool, not a universal replacement for Adam.

A note on limitations and future directions

  • The experiments turned off weight decay and avoided extra stabilization tricks to keep comparisons clean. Real-world training often uses those, so future work should test how spectral methods interact with them.
  • The fast matrix method (coupled Newton–Schulz) still has overhead; improving efficiency would help scale these ideas further.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what the paper leaves unresolved, focusing on concrete, actionable gaps for future research.

  • External validity across scales and architectures:
    • Generalization beyond a single 124M-parameter GPT-2 on OpenWebText is untested (e.g., larger LLMs, ViTs/CNNs, speech models, RL, MT).
    • No evidence on downstream task performance (e.g., zero-shot, finetuning, transfer) beyond training/validation loss.
  • Weight decay and regularization:
    • Interaction between spectral updates and decoupled weight decay (AdamW-style) is not evaluated; best practices for matrix-wise vs elementwise decay remain unknown.
    • Interplay with widespread LLM stabilizers (e.g., QK-Norm, QK-Clip, gradient clipping) is deliberately excluded, leaving their combined effects uncharacterized.
    • Whether spectral compression acts as implicit regularization (and its effect on generalization gap) is not measured.
  • Runtime, memory, and scalability:
    • No wall-clock, FLOPs, memory footprint, or throughput measurements comparing spectral variants to Adam across model sizes.
    • Scalability and communication costs under data/tensor/pipeline parallelism are unquantified (e.g., extra matmul kernels inside the optimizer).
    • Practical feasibility on very large layers and tall/skinny vs wide/short matrices is not established.
  • Numerical stability and precision:
    • Robustness of the coupled Newton–Schulz iteration in mixed precision (bf16/fp16/fp8), under gradient spikes or extreme conditioning, is untested.
    • Convergence safeguards (e.g., adaptive scaling α, iteration caps, fallbacks) and their impact on numerical stability are not characterized.
  • Rank deficiency and singular-value pathologies:
    • Behavior of the polar factor in rank-deficient or near-singular updates (partial isometry corner cases) is not analyzed; effects on step quality remain unclear.
    • Handling of exact zeros or extremely small singular values (and their numerical consequences) is unspecified.
  • Design and scheduling of spectral compression:
    • Only fixed exponents p ∈ {1, 1/2, 1/4, 0} are tested; no adaptive p selection (per-layer, per-step) or time-varying schedules are explored.
    • Criteria to choose p based on online conditioning, signal-to-noise, or training phase are missing.
    • Periodic or stochastic application of spectral transforms (vs every step) to reduce overhead and preserve beneficial anisotropy is unexplored.
  • Algorithmic ablations and controls:
    • Number of Newton–Schulz iterations K, choice of scaling α, and their accuracy–speed trade-offs are not ablated.
    • Sensitivity to Adam hyperparameters (β1, β2, ε), learning-rate schedules (cosine/linear, warmup length), and batch size is not investigated.
    • Multi-seed variability, confidence intervals, and statistical significance are not reported; “representative runs” may mask variance.
  • Comparative baselines:
    • No head-to-head comparison with strong matrix/curvature-aware baselines (e.g., Shampoo, K-FAC), or widely used optimizers (Adafactor, LAMB, Lion, Sophia).
    • Absence of AdamW with tuned weight decay as a baseline weakens conclusions about superiority or parity in realistic settings.
  • Measurement of spectral effects:
    • The hypothesized “noise magnification” in small singular directions is not quantitatively validated (e.g., SNR per singular mode, effective rank dynamics).
    • No tracking of update singular spectra or condition numbers during training to correlate anisotropy control with stability/performance outcomes.
    • Alignment of update directions with raw gradients (cosine similarity, step efficiency) is not measured.
  • Scope of parameter coverage:
    • Only “matrix-shaped” parameters receive spectral updates; whether embeddings, LayerNorm scales/biases, and other vector-like or block-structured parameters could benefit from tailored spectral schemes is left open.
    • Treatment of non-square matrices when n > m (the paper assumes m ≥ n) and layer-shape-specific strategies (e.g., using O OT vs OT O) are not detailed.
  • Tensor and convolutional parameters:
    • Extensions of spectral transforms to higher-order tensors (e.g., convolutional kernels) or structured factorizations (e.g., Kronecker, blockwise) are not explored.
  • Theory and guarantees:
    • No convergence analysis, regret bounds, or stability guarantees for Ψp under stochastic nonconvex settings.
    • Lack of formal preconditioning interpretation for rectangular matrices (i.e., when and how Ψp equates to a principled preconditioner).
    • No theory linking p, curvature, and gradient noise to optimal risk or convergence rates.
  • Practical training questions:
    • Interaction with gradient clipping (ubiquitous in LLMs) is untested; whether spectral flattening reduces/increases clipping events is unknown.
    • Stage-dependent strategies (e.g., stronger compression early, milder later) and their impact on final perplexity or overfitting are unexplored.
    • Generalization to finetuning regimes (instruction tuning, RLHF, low-resource) and catastrophic forgetting behavior are untested.
  • Implementation and reproducibility:
    • Exact reproducibility details (seeds, full configs, logs) and completeness of the forthcoming code release are unspecified.
    • Guidance for practitioners on when the extra optimizer cost is justified by gains (decision rules, heuristics) is not provided.
  • Methodological extensions:
    • Alternative spectral nonlinearities (e.g., soft-thresholding, spectral clipping, learned shrinkage) beyond power-law σp are not examined.
    • Combining spectral compression with curvature-aware preconditioners (e.g., Ψp atop Adam/Shampoo) is untested.

Practical Applications

Overview

Based on the paper’s unified spectral framework for matrix-shaped updates, its efficient coupled Newton–Schulz (NS) computation for fractional spectral transforms, and controlled empirical findings comparing Muon-style orthogonalization to Adam-like RMS normalization, the following applications emerge across industry, academia, policy, and daily practice.

Immediate Applications

  • 1) Drop-in spectral variants of Adam for matrix-shaped parameters (software/AI)
    • What: Use partial spectral compression on RMS-normalized updates (e.g., “AdamS” with p=1/2 or “AdamQ” with p=1/4) as a drop-in optimizer for matrix parameters while keeping standard Adam for vectors.
    • Tools/workflows:
    • PyTorch/TF optimizer wrappers that apply Ψp only to matrix-shaped weights.
    • Configurable p∈{1, 1/2, 1/4, 0}, tuned separately from vector learning rates.
    • Automated layer tagging (e.g., linear/attention projections).
    • Assumptions/dependencies:
    • Benefits are modest when second-moment normalization is already used; p=1/2 often a safe default.
    • Additional matrix multiplications add overhead; suitable for GPU/TPU.
    • Numerical scaling for NS iterations required for stability.
  • 2) Stabilize momentum-based training with Muon-like orthogonalization (software/robotics/embedded)
    • What: For pipelines still relying on first-moment momentum (mSGD) or constrained environments, apply Muon (p=0) or mild compression (p=1/4, 1/2) to improve robustness to ill-conditioning.
    • Tools/workflows:
    • mSGDZ (Muon) updates on matrix parameters; keep vectors on Adam or mSGD.
    • Fallback workflow: increase compression (lower p) only for unstable layers.
    • Assumptions/dependencies:
    • May underperform Adam in final metrics; best suited where RMS normalization is not feasible.
    • Matrix sizes should permit efficient batched matmuls.
  • 3) Efficient spectral primitives via coupled Newton–Schulz (software/hardware acceleration)
    • What: Package the coupled NS iteration to compute X±1/2, polar factors, and quarter powers using only matrix multiplications (no SVD).
    • Tools/products:
    • Library kernels (CUDA/ROCm/XLA) for batched small-matrix NS updates.
    • Reusable module for optimizers, polar decomposition, and spectral normalization.
    • Assumptions/dependencies:
    • Convergence depends on input scaling; implement normalization and iteration caps.
    • Memory/compute overhead scales with number of matrix parameters and NS iterations.
  • 4) Layer-selective spectral compression for “problem” matrices (LLMs/CV/ASR)
    • What: Apply p<1 only to specific unstable modules (e.g., attention projections, deep MLP layers), leaving others on standard Adam.
    • Tools/workflows:
    • Per-layer p scheduling and per-group learning rates.
    • Heuristics: lower p if condition proxies spike or training diverges.
    • Assumptions/dependencies:
    • Requires instrumentation to identify unstable layers.
    • Gains depend on model- and task-specific spectra.
  • 5) Spectral anisotropy diagnostics for training health (industry/academia)
    • What: Monitor condition-number proxies κ(O_t) of update matrices to detect ill-conditioning and trigger mitigations (lower p, reduce LR).
    • Tools/workflows:
    • Logging hooks for κ estimates via O_tT O_t (power iterations or low-rank approximations).
    • Integration with experiment trackers (W&B/MLflow).
    • Assumptions/dependencies:
    • Estimating κ efficiently requires approximations; tradeoff between accuracy and overhead.
    • Most valuable for matrix-heavy architectures (Transformers).
  • 6) Reproducible optimizer benchmarking protocol (academia/standards/policy)
    • What: Adopt the paper’s controlled comparison setup—decoupled matrix/vector LRs, disabled weight decay, and removal of auxiliary tricks—to isolate optimizer effects.
    • Tools/workflows:
    • Shared harness with fixed β1/β2, batch size, warmup, and tuning grids.
    • Reporting templates including matrix/vector LR separation and spectral settings.
    • Assumptions/dependencies:
    • Protocol may differ from production recipes (e.g., weight decay off); use both “controlled” and “production” evaluations for completeness.
  • 7) Teaching modules for spectral optimization (education)
    • What: Use the unified Ψp framework and coupled NS iteration to teach matrix functions, polar decomposition, and optimizer design.
    • Tools/workflows:
    • Jupyter notebooks demonstrating p∈{1, 1/2, 1/4, 0}, κ evolution, and NS convergence.
    • Assumptions/dependencies:
    • Requires GPU/CPU with BLAS for smooth classroom demos.

Long-Term Applications

  • 1) Adaptive spectral exponent scheduling (software/AI research)
    • What: Learn or schedule p online per layer based on κ(O_t) or other signals, balancing stabilization vs. preserving informative anisotropy.
    • Tools/products:
    • “Adaptive Spectral Adam” that tunes p∈[0,1] via rules or learned controllers.
    • Assumptions/dependencies:
    • Requires robust κ estimation and stability analysis; added compute.
    • Interaction with momentum/second-moment dynamics needs theoretical support.
  • 2) Production-ready spectral optimizers integrated with weight decay and popular LLM tricks (industry)
    • What: Combine spectral compression with decoupled weight decay (AdamW), QK-Norm/Clip, and large-scale training recipes to improve stability at scale.
    • Tools/products:
    • Next-gen optimizer suite exposing p, decay, and norm/clip knobs with safe defaults.
    • Assumptions/dependencies:
    • Paper’s findings exclude weight decay and auxiliary tricks; empirical re-validation required.
    • Large-scale ablations on multi-billion parameter models and multi-node training.
  • 3) High-performance, distributed NS kernels (systems/hardware)
    • What: Mixed-precision, fused, and distributed coupled-NS implementations (e.g., FP8/FP16 with error compensation) for batched small matrices across devices.
    • Tools/products:
    • Custom kernels in cuBLASLt/TVM/Triton; XLA SPMD partitioning strategies.
    • Assumptions/dependencies:
    • Numerical error control is critical; hardware-specific tuning needed.
    • Benefits must outweigh kernel engineering costs.
  • 4) Energy and cost reductions via broader stable LR ranges (operations/energy)
    • What: Use partial spectral compression to widen stable learning-rate regions, reducing hyperparameter sweeps and failed runs.
    • Tools/workflows:
    • AutoML systems that toggle p as a stabilization knob during search.
    • Assumptions/dependencies:
    • Gains likely task- and model-dependent; may be modest with RMS-normalized optimizers.
    • Must be validated with end-to-end cost metrics.
  • 5) Sector-specific robust training recipes (healthcare, finance, safety-critical AI)
    • What: For models where training failures are costly, use conservative spectral settings (e.g., p=1/2 on selected layers) to reduce instability without large performance trade-offs.
    • Tools/workflows:
    • Compliance-ready training templates that prioritize stability and auditability.
    • Assumptions/dependencies:
    • Domain generalization requires validation; modest compute overhead acceptable.
    • Regulatory environments may demand extensive documentation of optimizer choices.
  • 6) Applications beyond deep nets: spectral updates in control/RL and calibration (robotics)
    • What: Apply spectral compression/orthogonalization to matrix-valued updates in policy optimization, system identification, or calibration to mitigate ill-conditioning.
    • Tools/workflows:
    • Optimizer modules wrapping policy-gradient or Jacobian-based updates with Ψp.
    • Assumptions/dependencies:
    • Structure of update matrices must align with spectral assumptions; empirical benefits uncertain.
    • Real-time constraints may limit NS iteration counts.
  • 7) Extensions to tensor-valued parameters (research)
    • What: Generalize Ψp to tensors via multi-mode SVD, Kronecker, or low-rank factorizations for convolutional kernels and attention tensors.
    • Tools/products:
    • Approximate multi-mode spectral transforms with efficient factorizations.
    • Assumptions/dependencies:
    • Algorithmic complexity and memory can be high; theory and scalable approximations needed.
  • 8) Standardized reporting for optimizer studies (policy/standards)
    • What: Encourage venues and repositories to require reporting of matrix vs. vector LRs, weight decay coupling, and use of auxiliary techniques to improve comparability.
    • Tools/workflows:
    • Checklists and metadata schemas for optimizer configurations.
    • Assumptions/dependencies:
    • Community and venue buy-in; backward compatibility with existing benchmarks.

Notes on Feasibility and Generalization

  • The paper’s empirical conclusions are drawn from controlled GPT-2 (124M) experiments on nanoGPT with weight decay disabled and without auxiliary tricks (e.g., QK-Norm/Clip). Results may differ in production-scale settings.
  • Coupled NS iteration is practical for matrix-shaped parameters but introduces nontrivial matmul overhead; benefits depend on matrix size distributions and available accelerator throughput.
  • Code availability (BeyondMuon repository) is a dependency for replication and rapid prototyping.

Glossary

  • Anisotropic scaling: Direction-dependent magnitude differences in a matrix’s spectrum that can cause imbalanced updates; reducing it improves stability. "As a result, Muon removes anisotropic scaling in the spectrum of MtM_t, yielding directionally balanced but magnitude-agnostic updates."
  • Coupled Newton–Schulz iteration: A two-sequence iterative method that simultaneously computes a matrix square root and its inverse using only matrix multiplications. "Coupled Newton-Schulz iteration~\cite{higham_higham2008functions} provides an efficient and numerically stable procedure for simultaneously computing the matrix square root X1/2X^{1/2} and its inverse X1/2X^{-1/2}."
  • Fisher information matrix: A matrix capturing curvature (information) of the parameterized model’s likelihood, often used for second-order preconditioning. "K-FAC~\cite{kfac_martens2015optimizing} introduced a Kronecker-factored approximation of the Fisher information matrix, enabling efficient layer-wise preconditioning,"
  • Frobenius norm: The square root of the sum of squared matrix entries; commonly used for normalization and scaling in matrix algorithms. "In practice, a normalization step based on the Frobenius norm of XX is applied at initialization to ensure numerical stability,"
  • K-FAC: Kronecker-Factored Approximate Curvature; an optimizer that approximates curvature via Kronecker factorizations of the Fisher information matrix. "K-FAC~\cite{kfac_martens2015optimizing} introduced a Kronecker-factored approximation of the Fisher information matrix, enabling efficient layer-wise preconditioning,"
  • Kronecker factorizations: Decompositions using Kronecker products along matrix dimensions to structure curvature approximations. "Shampoo~\cite{shampoo_gupta2018shampoo} extended this idea by applying Kronecker factorizations along multiple matrix dimensions, leading to improved stability."
  • Manifold-aware optimization: Optimization that respects the geometric constraints of parameter manifolds (e.g., orthogonal groups). "and geometric interpretations that view it as implicit manifold-aware optimization~\cite{bernstein_bernstein2025manifolds, sam_buchanan2025mmuonadmm}."
  • Momentum SGD: Stochastic gradient descent augmented with momentum to accumulate first-moment information across iterations. "Adam-style optimizers that incorporate second-moment statistics generally outperform momentum SGD-based optimizers that only leverage first-moment information."
  • Newton–Schulz iteration: An iterative scheme to compute matrix inverse roots using only matrix multiplications; often used to avoid explicit SVD. "However, we can approximate the inverse square root using Newton--Schulz iteration, which relies only on matrix multiplications."
  • Orthogonalization: Transforming a matrix update to have orthonormal singular directions (unit magnitude across directions). "Computing the polar factor is typically referred to as matrix orthogonalization."
  • Polar factor: The orthogonal component in the polar decomposition of a matrix, given by UVU V^{\top} for the SVD UΣVU\Sigma V^\top. "Muon defines the update direction as the polar factor"
  • Preconditioning: Re-scaling or transforming gradients/updates to account for curvature, improving optimization stability and convergence. "From a preconditioning perspective, Adam applies a diagonal matrix Vt1/2V_t^{-1/2} to the momentum MtM_t, normalizing each coordinate by an estimate of its second-moment"
  • RMS-normalized update: An update normalized by the root-mean-square of past gradients (second-moment), as in Adam. "we refer to MtVtM_t \oslash \sqrt{V_t} as the RMS-normalized update, since VtV_t tracks an exponential moving average of squared gradients."
  • Singular value decomposition (SVD): Factorization of a matrix into singular vectors and singular values; central to spectral methods. "Computing the polar factor

(Mt)=UV=Mt(MtMt)1/2(M_t) = U V^{\top} = M_t(M_t^{\top}M_t)^{-1/2}

via an explicit SVD is prohibitively expensive for large models."

  • Singular vectors: The orthonormal bases (left/right) from SVD that define principal directions of a matrix’s action. "which discards the singular values and retains only the left and right singular vectors."
  • Spectral anisotropy: Uneven distribution of singular values across directions, leading to ill-conditioning in updates. "Therefore, decreasing pp reduces spectral anisotropy:"
  • Spectral compression: Dampening large singular values relative to small ones to reduce anisotropy and stabilize updates. "while spectral compression provides strong stabilization benefits under first-moment updates, the Muon update (p=0p = 0) does not consistently outperform Adam."
  • Spectral condition number: The ratio of the largest to smallest singular values; measures ill-conditioning of a matrix. "To quantify anisotropy, we use the spectral condition number"
  • Spectral flattening: Mapping all nonzero singular values to a constant (often 1), equalizing directional magnitudes. "This viewpoint clarifies how the exponent pp continuously interpolates between the original spectrum (p=1p=1) and complete spectral flattening (p=0p=0)."
  • Spectral normalization: Rescaling updates in the spectral domain (via singular values) to control magnitudes across directions. "These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method."
  • Spectral transformation: A mapping of a matrix’s SVD components via a function of singular values (e.g., Σp\Sigma^p) to shape update behavior. "we view Muon as the p=0p = 0 endpoint of a family of spectral transformations of the form UΣpV\boldsymbol{U} \boldsymbol{\Sigma}^{p} \boldsymbol{V}^{\top}"

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 134 likes about this paper.