Papers
Topics
Authors
Recent
Search
2000 character limit reached

How to Set the Batch Size for Large-Scale Pre-training?

Published 8 Jan 2026 in cs.AI | (2601.05034v2)

Abstract: The concept of Critical Batch Size, as pioneered by OpenAI, has long served as a foundational principle for large-scale pre-training. However, with the paradigm shift towards the Warmup-Stable-Decay (WSD) learning rate scheduler, we observe that the original theoretical framework and its underlying mechanisms fail to align with new pre-training dynamics. To bridge this gap between theory and practice, this paper derives a revised E(S) relationship tailored for WSD scheduler, characterizing the trade-off between training data consumption E and steps S during pre-training. Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens. Building upon these properties, we propose a dynamic Batch Size Scheduler. Extensive experiments demonstrate that our revised formula precisely captures the dynamics of large-scale pre-training, and the resulting scheduling strategy significantly enhances both training efficiency and final model quality.

Summary

  • The paper revises the Critical Batch Size theory under WSD LR schedules, introducing a new E(S) formulation that captures dynamic training phases.
  • It defines novel metrics Bmin and Bopt to determine optimal batch sizes, empirically demonstrating improved data efficiency and convergence.
  • Dynamic batch size scheduling is shown to outperform fixed-batch methods, leading to smoother loss convergence and enhanced downstream performance.

Revisiting Batch Size Scheduling for Large-Scale Model Pre-Training under WSD LR Schedules

Introduction

The effective configuration of batch size is a pivotal consideration in the optimization of large-scale pre-training for LLMs. Traditionally, the Critical Batch Size theory provided a principled framework for balancing data consumption and optimization steps under cosine learning rate schedules. However, the paradigm shift toward Warmup-Stable-Decay (WSD) learning rate schedulers has rendered the foundational assumptions and resulting scaling relations inapplicable. The paper "How to Set the Batch Size for Large-Scale Pre-training?" (2601.05034) addresses this theoretical and practical disconnect by developing a revised E(S)E(S) relationship tailored to WSD-based training regimes, identifying new batch size metrics, and devising a dynamic batch size scheduler that empirically yields improved efficiency and downstream performance.

Breakdown of Critical Batch Size Theory under WSD LR Schedules

The Critical Batch Size framework posits a monotonic trade-off between token consumption EE and optimization steps SS for achieving a fixed target loss, a relationship well-captured by the E(S)E(S) formula

(EEmin1)(SSmin1)=1.(\frac{E}{E_{min}}-1)\left(\frac{S}{S_{min}}-1\right)=1.

This theory assumes a cosine or static learning rate schedule; however, empirical evidence shows that under WSD schedules, loss curves for different batch sizes intersect as training progresses. In particular, larger batch sizes may, at lower target losses, consume less data than smaller batch sizes—directly contradicting the monotonic ordering predicted by the classic E(S)E(S) formula. This regime change is visually evident in loss curves (Figure 1). Figure 1

Figure 1: Loss curves for models trained with different batch sizes under WSD schedule, illustrating breakdown of Critical Batch Size theory as the curves invert partial ordering post-intersection.

Novel E(S)E(S) Formulation for WSD Schedules

To explain the observed phenomena, the authors construct a piecewise E(S)E(S) relationship reflecting three distinct dynamic phases:

  • Inverse Linear Stage: EE varies inversely with SSminS-S_{min}.
  • Transition Stage: EE follows a quadratic function in SS.
  • Linear Stage: EE increases linearly with SS.

These phases collectively capture the trade-off between computation steps and data usage across batch sizes in WSD-scheduled training. The revised E(S)E(S) employs parameter fitting subject to continuity and differentiability constraints, demonstrated to closely match empirical measurements across a suite of model sizes and batch size settings (Figure 2). Figure 2

Figure 2: Fitting results of E(S)E(S) for InternLM2-1B in the loss interval [2.93, 3.25], validating the accuracy of the new formulation.

Emergent Metrics: BminB_{min} and BoptB_{opt}

Analysis of the new E(S)E(S) instantiates two core batch size metrics supplanting the old Critical Batch Size:

  • BminB_{min}: The minimum batch size required to reach a particular target loss.
  • BoptB_{opt}: The batch size that optimizes data efficiency, minimizing token consumption to achieve the target loss.

Figure 3 shows that both BminB_{min} and BoptB_{opt} increase monotonically as training loss decreases, providing the rationale for progressive batch size expansion during pre-training. Figure 3

Figure 3: The scaling relationship shows that BminB_{min} and BoptB_{opt} rise with decreasing target loss across model sizes.

Practical Batch Size Scheduling: Dynamic Expansion

Given the non-optimality of fixed batch sizes in the WSD regime, a dynamic batch size scheduling algorithm is derived that progressively expands the batch size, informed by the empirical curves of BoptB_{opt} relative to cumulative data consumption. This strategy is convex for maximizing efficiency and achieving deeper convergence.

Empirical studies executed with Qwen3-Dense and Qwen3-MoE architectures demonstrate that dynamic batch size schedules outperform fixed-batch baselines in training efficacy and downstream task results. In both cases—under constant learning rate regimes—the dynamic strategy yields smoother, accelerated loss convergence and higher MMLU/CMMLU scores (Figures 4-7). Figure 4

Figure 4: Training loss trajectories under fixed vs. dynamic batch size scheduling for Qwen3 MoE model at constant learning rate.

Figure 5

Figure 5: Downstream benchmark results for Qwen3 MoE, confirming sustained superiority of dynamic scheduling.

Figure 6

Figure 6: Training loss curves for Qwen3 Dense comparing fixed with dynamic batch schedule approaches.

Figure 7

Figure 7: Comparative evaluation for Qwen3 Dense on downstream tasks, dynamic scheduling maintains higher scores.

Ablations and Robustness

Comprehensive ablations confirm the generality and adaptability of the dynamic batch size scheduler across:

  • Cosine LR Schedules: Dynamic batch size remains advantageous (Figure 8).
  • Learning Rate Scaling: Synchronous LR scaling with batch size offers no additional gains and may dilute gradient noise suppression (Figure 9).
  • Sequence Length Scaling vs. Micro-batch Expansion: Sequence length change introduces undesirable distribution shift and adaptation delays (Figure 10).
  • Weight Decay: The effectiveness of dynamic batch sizing depends critically on sufficient regularization strength (Figure 11).
  • Continued Training/Annealing: Dynamic scheme sustains its advantage into the annealing phase with decayed learning rates (Figure 12). Figure 8

    Figure 8: Dynamic batch size scheduling remains beneficial with cosine learning rate schedules.

    Figure 9

    Figure 9: No substantive improvement from learning rate scaling alongside batch size expansion.

    Figure 10

    Figure 10: Sequence length scaling causes performance drops not observed with micro-batch scaling.

    Figure 11

    Figure 11: Reduced weight decay diminishes advantage of dynamic batch size scheduling.

    Figure 12

    Figure 12: Dynamic scheduling's performance persists under learning rate annealing phase.

Theoretical and Practical Implications

The findings formally invalidate classic Critical Batch Size theory for current large-scale training practice employing WSD LR schedules. Crucially, efficiency-optimal batch size is not static but monotonically increases with training progression, motivating batch schedule dynamics. This realization impacts practical LLM pre-training: fixed batch size configurations are suboptimal, and dynamic expansion should be implemented to maximize both efficiency and model quality.

Theoretically, the suite of analytical results—tripartite E(S)E(S) structure and associated constraints—provides a basis for further predictive modeling of training dynamics and hyperparameter scaling, advancing principled approaches to large-scale model optimization.

Future Directions

The paper flags notable open challenges, including generalization of E(S)E(S) fitting across LR schedules, formal proof of global optimality for dynamic scheduling, and mitigation of distributional shifts associated with sequence length scaling. Addressing these will enable broader adoption and flexibility in batch scheduling paradigms.

Conclusion

This study robustly establishes the breakdown of previous batch size theory under WSD learning rate scheduling and provides a new theoretical and empirical basis for dynamic batch size scheduling in large-scale pre-training. The dynamic strategy yields tangible gains in efficiency and downstream model quality and should be incorporated into future generation foundation model pipelines. The investigative framework also opens new avenues for hyperparameter scaling law research as model scale and training corpus size continue to grow.

Whiteboard

Explain it Like I'm 14

Overview

This paper looks at how to choose the “batch size” when training very LLMs. Batch size is how many examples the model studies at once before it updates itself. The authors show that a popular old rule for picking batch size no longer works with the way people train modern models. They create a new way to think about batch size and propose a strategy to adjust it during training to make learning faster and better.

Key Questions

The paper asks simple but important questions:

  • How should we set batch size when using the common Warmup–Stable–Decay (WSD) learning rate schedule (start slow, train steadily, then slow down)?
  • Is there a minimum batch size needed to reach a certain quality (low loss)?
  • Is there an “best” batch size that uses the least data to reach a chosen quality?
  • Can we improve training by changing (not fixing) batch size over time?

Methods and Approach

Think of training an LLM like practicing for a big exam:

  • Batch size is how many practice questions you do before checking your progress.
  • Learning rate is how big a change you make to your study plan after each check.
  • The WSD schedule is like: warm up (start gently), stable (keep a steady pace), decay (slowly ease off).

Earlier work (by OpenAI) used a formula to trade off “data used” (EE, like total tokens read) against “steps taken” (SS, like number of updates) to reach a target score (low loss). That old formula worked well with a different learning rate schedule but not with WSD’s long, steady middle phase.

What the authors did:

  • They measured how much data is needed to reach certain loss levels under WSD and noticed training curves for different batch sizes can cross. In simple terms: sometimes a bigger batch actually needs less total data to reach a deeper (lower) loss, which breaks the old rule.
  • They built a new piecewise model of E(S)E(S) (data vs steps) tailored for WSD’s stable phase:
    • At first, using more steps can sharply reduce data needed.
    • In the middle, there’s a “sweet spot” where the trade-off is curved (like a valley).
    • Later, data needed grows more linearly with steps.
  • They fit this new curve to real training results using a robust method (a fitting technique that isn’t thrown off by noisy points).
  • From this curve, they defined two simple, useful batch size measures:
    • BminB_{\min}: the smallest batch size you must use to reach a chosen loss (quality).
    • BoptB_{\text{opt}}: the batch size that reaches that loss using the least total data.

Finally, they created a practical “Batch Size Scheduler”: instead of picking one batch size for the whole run, gradually increase batch size as training progresses (especially through the stable phase), guided by the curve of BoptB_{\text{opt}} vs how much data you’ve already used.

Main Findings and Why They Matter

  • The old “Critical Batch Size” idea doesn’t describe modern training under WSD. In the stable phase, batch size behavior changes: curves cross, and bigger batches can become more data-efficient at lower losses.
  • The new E(S)E(S) curve fits real training data well and leads to clear, interpretable batch size rules:
    • BminB_{\min} is the minimum you need to even reach your target quality.
    • BoptB_{\text{opt}} is the best batch size for using the least data.
  • Both BminB_{\min} and BoptB_{\text{opt}} go up as you aim for lower loss (better quality). Translation: as the model gets better, it benefits from larger batch sizes.
  • A dynamic batch size schedule (start smaller, then step up the batch size during the stable phase) consistently beat a fixed batch size in experiments:
    • It lowered training loss faster.
    • It improved scores on benchmarks like MMLU and CMMLU across different model types (dense and MoE).
  • Extra checks (ablations) showed:
    • The strategy also helps with cosine learning rates.
    • Increasing learning rate alongside batch size (a common trick) didn’t help here and could hurt, because it adds noise.
    • Increasing sequence length to simulate bigger batches can cause temporary performance drops due to distribution shifts (the model starts seeing longer texts and needs time to adapt).
    • Regularization strength (weight decay) matters: the scheduler works best with a solid weight decay setting.
    • The benefits persist into the decay phase (later part of training) too.

Implications and Impact

This work gives both a new theory and a practical recipe for modern LLM training:

  • Theory: It replaces the old, now-misleading batch size rule with a WSD-friendly framework that explains what’s really happening during the long, steady training phase.
  • Practice: It shows that gradually increasing batch size over training can save data, speed up convergence, and yield better models.

In simple terms, it helps teams train big models more efficiently and reach higher quality without wasting compute. Over time, this could lower training costs, make strong models more accessible, and improve performance across many AI applications.

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

The paper provides a theoretical framework for setting batch sizes during large-scale pre-training with a Warmup-Stable-Decay (WSD) learning rate scheduler. However, there are several areas that require further exploration:

  • Generalizability across Learning Rates: The E(S)E(S) curve fitting was conducted for a specific learning rate of 6×1046 \times 10^{-4}. It remains unclear how this relationship would change under different learning rates.
  • Theoretical Proof for Dynamic Strategy: While the dynamic batch size strategy shows empirical success, the paper does not provide a formal theoretical proof supporting the approach.
  • Impact of Gradient Noise: The investigation of gradient noise effects on learning dynamics under the WSD schedule is mentioned but not deeply explored, specifically the influence of ΣB\frac{\Sigma}{B} on model training dynamics.
  • Adaptive Strategies for Sequence Length Changes: The paper identifies performance changes associated with increased sequence lengths but does not propose methods for mitigating adaptation issues when extending sequence length.
  • Weight Decay Influence: The empirical investigation of weight decay reveals significant interactions with batch size dynamics, yet the underlying mechanisms remain unexplored. How weight decay settings influence the scheduling strategy warrants further study.
  • Continued Training Under Decay: The robustness of the proposed strategy during the decay phase is demonstrated, but more insights into how batch size scheduling interacts with high-quality data annealing are needed.
  • Compatibility with Other Optimizers: The analysis focuses solely on the Adam optimizer. Exploring how other optimizers might perform under a similar batch size scheduling strategy remains an open question.
  • Scalability Across Model Architectures: While validated on specific architectures, such as InternLM2 and Qwen3, the approach's applicability to other model architectures (e.g., Transformers with different attention mechanisms) is not examined.

These gaps provide a substantial opportunity for future research to enhance the understanding and practical application of dynamic batch sizing in LLM pre-training.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 42 likes about this paper.