Minimal hyperparameter set for reliable transfer of BST scaling predictions
Determine the minimal set of optimizer and model hyperparameters that must be tuned on a small model so that the batch–sequence–token (BST) scaling rule and the associated theoretical predictions for selecting batch size, sequence length, and Frank–Wolfe stepsize remain practically actionable when transferred to larger models or longer token budgets.
References
Determining the minimal set of hyperparameters that must be tuned on a small model to ensure that our theoretical predictions remain practically actionable remains an important open question, which we leave for future work.
— On the Role of Batch Size in Stochastic Conditional Gradient Methods
(2603.21191 - Islamov et al., 22 Mar 2026) in Limitations and Future Work