Theory of auxiliary load-balancing loss under bounded logits from hypersphere optimization

Develop a theoretical analysis explaining why, under hypersphere optimization with Frobenius-sphere constraints that bound router logits, large weights on the Switch-Transformer load-balancing auxiliary loss do not harm language modeling quality and can improve expert load balance.

Background

Experiments show that with MuonH and Frobenius-sphere constraints, larger auxiliary loss weights for load balancing in Mixture-of-Experts layers yield both better validation loss and improved load balance—contrary to prior findings that large auxiliary weights can hurt quality.

The authors hypothesize that bounded logits under hypersphere optimization prevent interference between the auxiliary loss and the language modeling objective, and call for a theoretical explanation of this effect.

References

Under hypersphere optimization, the bounded logits (\Cref{prop:bounded}) likely prevent the auxiliary loss from interfering with the language modeling objective, and we leave the theoretical study to future work.

Rethinking Language Model Scaling under Transferable Hypersphere Optimization  (2603.28743 - Ren et al., 30 Mar 2026) in Section 4.4 MoE Scaling – Auxiliary Balance Loss