Critical batch size dependence on training token budget

Determine how the critical batch size depends on the total number of training tokens for Transformer-Next models trained with MuonH under the HyperP parameterization by measuring when increasing batch size begins to degrade the minimum achievable validation loss across multiple token budgets.

Background

The authors study batch-size scaling at a fixed token budget and observe that optimal losses are stable across tested batch sizes, indicating operation below the critical batch size. However, they do not explore how this threshold depends on the training token budget.

Understanding the relationship between critical batch size and token budget is important for scalable, compute-efficient training regimes, but requires repeating the experimental suite across multiple token budgets.

References

We leave the study of the relationship between critical batch size and training tokens to future work, as it requires a straightforward but costly scaling of the same suite of experiments explored in this section across multiple token budgets.

— Rethinking Language Model Scaling under Transferable Hypersphere Optimization (2603.28743 - Ren et al., 30 Mar 2026) in Section 4.3 Critical Batch Size

Critical batch size dependence on training token budget

Background

References

Related Problems