Explain why gradient orthogonalization works in Muon
Explain the theoretical mechanism by which orthogonalizing gradient updates via singular value decomposition—setting all non-zero singular values of the update to one, as implemented in the Muon optimizer—produces effective optimization behavior in deep learning models.
References
The reason why orthogonalization works stays unclear.
— To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters
(2603.00742 - Dragutinović et al., 28 Feb 2026) in Section 2 (Muon Optimizer)