Theory of auxiliary load-balancing loss under bounded logits from hypersphere optimization
Develop a theoretical analysis explaining why, under hypersphere optimization with Frobenius-sphere constraints that bound router logits, large weights on the Switch-Transformer load-balancing auxiliary loss do not harm language modeling quality and can improve expert load balance.
References
Under hypersphere optimization, the bounded logits (\Cref{prop:bounded}) likely prevent the auxiliary loss from interfering with the language modeling objective, and we leave the theoretical study to future work.
— Rethinking Language Model Scaling under Transferable Hypersphere Optimization
(2603.28743 - Ren et al., 30 Mar 2026) in Section 4.4 MoE Scaling – Auxiliary Balance Loss