Minimal hyperparameter set for reliable transfer of BST scaling predictions

Determine the minimal set of optimizer and model hyperparameters that must be tuned on a small model so that the batch–sequence–token (BST) scaling rule and the associated theoretical predictions for selecting batch size, sequence length, and Frank–Wolfe stepsize remain practically actionable when transferred to larger models or longer token budgets.

Background

The paper develops a token-budget–aware theory for scaling batch size, sequence length, and Frank–Wolfe stepsize in stochastic conditional gradient methods and proposes the BST scaling rule. In experiments, several hyperparameters (e.g., radii and variance initialization) are adopted from prior work without extensive tuning.

The authors note that substantially suboptimal choices of such hyperparameters can affect the practical accuracy of their BST-based predictions. They therefore highlight the need to identify which hyperparameters must be tuned on small proxy models to ensure that the theory provides reliable guidance at larger scales.

References

Determining the minimal set of hyperparameters that must be tuned on a small model to ensure that our theoretical predictions remain practically actionable remains an important open question, which we leave for future work.

On the Role of Batch Size in Stochastic Conditional Gradient Methods  (2603.21191 - Islamov et al., 22 Mar 2026) in Limitations and Future Work