Explain why gradient orthogonalization works in Muon

Explain the theoretical mechanism by which orthogonalizing gradient updates via singular value decomposition—setting all non-zero singular values of the update to one, as implemented in the Muon optimizer—produces effective optimization behavior in deep learning models.

Background

Muon performs approximate orthogonalization of gradient updates to constrain the operator norm of the step, a distinctive departure from SGD and Adam. While this design correlates with markedly faster training across several tasks, the principle underpinning why orthogonalization yields such benefits remains unspecified.

The authors note that even introductory materials attribute Muon’s success to unexplained factors, highlighting a need for a formal account of how and why gradient orthogonalization affects optimization dynamics and generalization in neural networks.

References

The reason why orthogonalization works stays unclear.

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters  (2603.00742 - Dragutinović et al., 28 Feb 2026) in Section 2 (Muon Optimizer)