Critical batch size dependence on training token budget
Determine how the critical batch size depends on the total number of training tokens for Transformer-Next models trained with MuonH under the HyperP parameterization by measuring when increasing batch size begins to degrade the minimum achievable validation loss across multiple token budgets.
References
We leave the study of the relationship between critical batch size and training tokens to future work, as it requires a straightforward but costly scaling of the same suite of experiments explored in this section across multiple token budgets.
— Rethinking Language Model Scaling under Transferable Hypersphere Optimization
(2603.28743 - Ren et al., 30 Mar 2026) in Section 4.3 Critical Batch Size