Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compute-Optimal Dataset Sizes

Updated 17 February 2026
  • Compute-optimal dataset sizes are defined as the principled allocation of training data that maximizes model quality under a fixed computational budget by balancing model size and data volume.
  • Empirical scaling laws such as the Chinchilla law show that maintaining a constant tokens-per-parameter ratio (around 20) is key to efficient model performance across diverse tasks.
  • The framework informs practical strategies in data curation and finetuning, highlighting trade-offs between overfitting with too few data and diminishing returns when scaling model size without proportional data increases.

Compute-optimal dataset sizes refer to the principled determination of training data volume that, jointly with model size, maximizes model quality for a fixed computational budget. These allocation laws, along with their associated workflow and practical recipes, are foundational for efficient large-scale training of language and protein models. They have recently been extensively revised, unified, and critiqued, with both empirical and information-theoretic approaches now in close agreement across modalities and tasks.

1. Basic Principles: The Compute-Budget Constrained Frontier

At the core of modern pretraining is the constraint that total computational cost (typically measured in floating-point operations, FLOPs) is first-order proportional to the product of model size (number of parameters, NN) and dataset size (number of tokens or examples, DD):

CN×DC \propto N \times D

Given a compute budget CC, one seeks (N,D)(N^*, D^*) minimizing the final loss (proxy: bits-per-character, negative log-likelihood), such that ND=CN D = C. In classic scaling laws, one models loss surfaces as additive power laws:

L(N,D)=ANα+BDβL(N, D) = A N^{-\alpha} + B D^{-\beta}

and derives the efficient frontier by balancing the marginal benefits of parameter and data scaling. Optimizing under CC yields closed-form allocations for N(C)N^*(C) and D(C)D^*(C) (Hoffmann et al., 2022, Dey et al., 2023, Yin et al., 2024).

2. The Chinchilla Law and Empirical Scaling in LLMs

The most influential empirical scaling law is the Chinchilla law of Hoffmann et al., validated by the Cerebras-GPT and Porian et al. studies (Hoffmann et al., 2022, Dey et al., 2023, Porian et al., 2024):

  • Optimal scaling: Empirically, N(C)C0.5N^*(C) \propto C^{0.5} and D(C)C0.5D^*(C) \propto C^{0.5}.
  • Constant “tokens-per-parameter” ratio: The optimal frontier is achieved at D/N20D^*/N^* \approx 20 for the MassiveText and Pile datasets.
  • Explicit rule:

N=C/r,D=rNN^* = \sqrt{C/r},\quad D^* = r N^*

where r20r \approx 20.

The optimal pair (N,D)(N^*, D^*) is robust to dataset type, optimizer, parametrization (μP), and minor implementation choices, provided total FLOPs are precisely counted (including final-layer FLOPs), and batch and learning rate scaling are properly tuned (Porian et al., 2024, Dey et al., 2023).

Table 1: Empirical Chinchilla Law Parameters

Source Exponent on CC (NN^*) Exponent on CC (DD^*) D/ND^*/N^* (tokens/param)
Hoffmann et al., 2022 0.5 0.5 20
Porian et al., 2024 0.498 0.498 21
Cerebras-GPT, 2023 0.5 0.5 20

Under these laws, increasing NN without proportional data scaling leads to overfitting and suboptimal loss; scaling DD without growing NN produces diminishing returns.

3. Unified Scaling Laws and Degeneracy

“More Compute Is What You Need” (Guo, 2024) reports that for transformer LLMs, model performance depends primarily on total computation CNDC \approx N D, independent of the specific split between parameters and tokens. The main empirical fit,

BPC(N,D)=αlog(ND)+β\mathrm{BPC}(N,D) = \alpha\log(ND) + \beta

with α=0.031\alpha = -0.031, β=0.572\beta = 0.572, and R20.95R^2 \sim 0.95 (across 20+ models), indicates that all pairs (N,D)(N,D) with ND=CN D = C achieve the same loss. This produces a degenerate optimum—any aspect ratio λ\lambda in (N,D)=(λC1/2,C1/2/λ)(N, D) = (\lambda C^{1/2}, C^{1/2}/\lambda) attains identical predicted performance.

Practical implication: Optimal (N,D)(N, D) allocation should reflect secondary desiderata:

  • Small NN, large DD: preferable for inference efficiency and latency.
  • Large NN, small DD: required if high-quality data is exhausted.

Under data exhaustion, only further model scaling improves performance (Guo, 2024).

4. Domain-Specific Scaling: Protein LMs and Skill-Dependence

Protein LLMs

Empirical studies in protein language modeling (Serrano et al., 2024, Cheng et al., 2024) show qualitatively different exponents from text LLMs:

  • Sublinear model scaling: N(C)C0.27N^*(C) \propto C^{0.27}, D(C)C0.71D^*(C) \propto C^{0.71}.
  • Plateau regime: For encoder-only pLMs, once a single epoch of unique tokens is seen, further data contributes negligible gains.
  • Loss-scaling for CLM/MLM: Closed forms

TCLM(C,N)C0.58N0.42,TMLM(C,N)C0.75N0.25T^*_{\rm CLM}(C,N) \propto C^{0.58} N^{-0.42},\quad T^*_{\rm MLM}(C,N) \propto C^{0.75} N^{-0.25}

indicate that for masked LMs optimal data sizes may exceed realistically available unique tokens.

Skill-Dependent Scaling Laws

Recent work demonstrates that compute-optimal dataset size is skill dependent (Roberts et al., 13 Mar 2025). For example, knowledge-based QA and code generation have optimal D(C)D^*(C) scaling exponents of $0.61$ and $0.66$, respectively (from N(C)C0.39N^*(C)\propto C^{0.39}, C0.34C^{0.34}). This reflects that code and reasoning evaluation rewards larger datasets relative to model size, while knowledge QA is relatively more capacity-bound.

5. Information-Theoretic and Random Graph Foundations

Probabilistic and information-theoretic analyses (Jeon et al., 2022, Nayak et al., 2024) reproduce the observed scaling laws:

  • For broad neural architectures, minimax bounds on cross-entropy error derive asymptotically linear data-to-parameter scaling: NkMN^* \propto k\,M^*, klnCk \sim \ln C growing slowly in compute.
  • In semantic graph-based formulations, compute-optimal scaling arises from LDPC iterative decoding and matching the critical threshold for coverage in a bipartite “concept–text” network, yielding N,DC1/2N^*, D^* \sim C^{1/2} (Nayak et al., 2024).

These theories predict emergent phenomena—such as performance plateaus and sharp appearance of new skills—arising when giant components (“skills”) percolate in the random graph as CC increases.

6. Compute-Optimal Data Curation and Finetuning

When the data collection process involves significant computational cost, as in expensive data selection or synthetic data generation, compute-optimal strategies must account for both selection and training cost (Bansal et al., 2024, Yin et al., 2024):

  • For synthetic data, repeated sampling from a weaker, cheaper generator (lower per-example FLOPs) yields greater coverage/diversity and better downstream scores for fixed budget, provided false positive rate is controlled. The transition to sampling from a strong generator is only favorably crossed when these benefits are demonstrably saturated.
  • In data selection for finetuning, the optimal dataset size is Nm=Ctotal/(c(train)+cm(sel))N^*_m = C_{\rm total}/(c^{(\rm train)} + c^{(\rm sel)}_m), where cm(sel)c^{(\rm sel)}_m is the per-token selection cost of method mm. Simple methods (BM25, Embed) are almost always compute-optimal at moderate budgets; high-cost selectors (perplexity, gradients) only become worthwhile when the train-to-selector model size ratio is 5×\gtrsim5\times or 10×10\times.

Table 2: Compute-Optimal Data Generation Strategies

Setting Optimal Allocation Condition
Synthetic data (weak/strong) All budget to weaker generator (WC) Coverage/diversity WC>>SE for rescaled kk
Data selection (finetuning) Max tokens with lowest-cost selector High-cost methods only optimal at scale

7. Practical Recommendations and Caveats

  • For LLMs: The fixed-ratio Chinchilla law (D/N20D/N\approx20) holds robustly for pretraining, provided compute is accurately accounted and hyperparameters are tailored per scale (Dey et al., 2023, Porian et al., 2024).
  • For specialized domains (e.g., protein LMs): Exponents may differ; do not blindly import NLP scaling laws, but measure domain-specific loss surfaces and account for loss plateaus and data limits (Serrano et al., 2024, Cheng et al., 2024).
  • Skill composition: Compute-optimal dataset/parameter split is task-dependent. When optimizing for multiple skills, composite validation sets must represent real-world priorities; otherwise, the optimum may shift by up to 50%50\% in parameter count due to validation set misspecification (Roberts et al., 13 Mar 2025).
  • Data quality and single-epoch limits: In regimes where unique data is exhausted, further gains require larger NN and increased compute, with little benefit from repeating tokens.
  • Extrapolation: Scaling laws are empirical and generally validated up to scales of C1023C\sim10^{23} FLOPs; extrapolation beyond this, or to new domains/architectures, should be approached with caution.

References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compute-Optimal Dataset Sizes.