Compute-Optimal Dataset Sizes

Updated 17 February 2026

Compute-optimal dataset sizes are defined as the principled allocation of training data that maximizes model quality under a fixed computational budget by balancing model size and data volume.
Empirical scaling laws such as the Chinchilla law show that maintaining a constant tokens-per-parameter ratio (around 20) is key to efficient model performance across diverse tasks.
The framework informs practical strategies in data curation and finetuning, highlighting trade-offs between overfitting with too few data and diminishing returns when scaling model size without proportional data increases.

Compute-optimal dataset sizes refer to the principled determination of training data volume that, jointly with model size, maximizes model quality for a fixed computational budget. These allocation laws, along with their associated workflow and practical recipes, are foundational for efficient large-scale training of language and protein models. They have recently been extensively revised, unified, and critiqued, with both empirical and information-theoretic approaches now in close agreement across modalities and tasks.

1. Basic Principles: The Compute-Budget Constrained Frontier

At the core of modern pretraining is the constraint that total computational cost (typically measured in floating-point operations, FLOPs) is first-order proportional to the product of model size (number of parameters, $N$ ) and dataset size (number of tokens or examples, $D$ ):

$C \propto N \times D$

Given a compute budget $C$ , one seeks $(N^*, D^*)$ minimizing the final loss (proxy: bits-per-character, negative log-likelihood), such that $N D = C$ . In classic scaling laws, one models loss surfaces as additive power laws:

$L(N, D) = A N^{-\alpha} + B D^{-\beta}$

and derives the efficient frontier by balancing the marginal benefits of parameter and data scaling. Optimizing under $C$ yields closed-form allocations for $N^*(C)$ and $D^*(C)$ (Hoffmann et al., 2022, Dey et al., 2023, Yin et al., 2024).

2. The Chinchilla Law and Empirical Scaling in LLMs

The most influential empirical scaling law is the Chinchilla law of Hoffmann et al., validated by the Cerebras-GPT and Porian et al. studies (Hoffmann et al., 2022, Dey et al., 2023, Porian et al., 2024):

Optimal scaling: Empirically, $D$ 0 and $D$ 1.
Constant “tokens-per-parameter” ratio: The optimal frontier is achieved at $D$ 2 for the MassiveText and Pile datasets.
Explicit rule:

$D$ 3

where $D$ 4.

The optimal pair $D$ 5 is robust to dataset type, optimizer, parametrization (μP), and minor implementation choices, provided total FLOPs are precisely counted (including final-layer FLOPs), and batch and learning rate scaling are properly tuned (Porian et al., 2024, Dey et al., 2023).

Table 1: Empirical Chinchilla Law Parameters

Source	Exponent on $D$ 6 ( $D$ 7)	Exponent on $D$ 8 ( $D$ 9)	$C \propto N \times D$ 0 (tokens/param)
Hoffmann et al., 2022	0.5	0.5	20
Porian et al., 2024	0.498	0.498	21
Cerebras-GPT, 2023	0.5	0.5	20

Under these laws, increasing $C \propto N \times D$ 1 without proportional data scaling leads to overfitting and suboptimal loss; scaling $C \propto N \times D$ 2 without growing $C \propto N \times D$ 3 produces diminishing returns.

3. Unified Scaling Laws and Degeneracy

“More Compute Is What You Need” (Guo, 2024) reports that for transformer LLMs, model performance depends primarily on total computation $C \propto N \times D$ 4, independent of the specific split between parameters and tokens. The main empirical fit,

$C \propto N \times D$ 5

with $C \propto N \times D$ 6, $C \propto N \times D$ 7, and $C \propto N \times D$ 8 (across 20+ models), indicates that all pairs $C \propto N \times D$ 9 with $C$ 0 achieve the same loss. This produces a degenerate optimum—any aspect ratio $C$ 1 in $C$ 2 attains identical predicted performance.

Practical implication: Optimal $C$ 3 allocation should reflect secondary desiderata:

Small $C$ 4, large $C$ 5: preferable for inference efficiency and latency.
Large $C$ 6, small $C$ 7: required if high-quality data is exhausted.

Under data exhaustion, only further model scaling improves performance (Guo, 2024).

4. Domain-Specific Scaling: Protein LMs and Skill-Dependence

Protein LLMs

Empirical studies in protein language modeling (Serrano et al., 2024, Cheng et al., 2024) show qualitatively different exponents from text LLMs:

Sublinear model scaling: $C$ 8, $C$ 9.
Plateau regime: For encoder-only pLMs, once a single epoch of unique tokens is seen, further data contributes negligible gains.
Loss-scaling for CLM/MLM: Closed forms

$(N^*, D^*)$ 0

indicate that for masked LMs optimal data sizes may exceed realistically available unique tokens.

Skill-Dependent Scaling Laws

Recent work demonstrates that compute-optimal dataset size is skill dependent (Roberts et al., 13 Mar 2025). For example, knowledge-based QA and code generation have optimal $(N^*, D^*)$ 1 scaling exponents of $(N^*, D^*)$ 2 and $(N^*, D^*)$ 3, respectively (from $(N^*, D^*)$ 4, $(N^*, D^*)$ 5). This reflects that code and reasoning evaluation rewards larger datasets relative to model size, while knowledge QA is relatively more capacity-bound.

5. Information-Theoretic and Random Graph Foundations

Probabilistic and information-theoretic analyses (Jeon et al., 2022, Nayak et al., 2024) reproduce the observed scaling laws:

For broad neural architectures, minimax bounds on cross-entropy error derive asymptotically linear data-to-parameter scaling: $(N^*, D^*)$ 6, $(N^*, D^*)$ 7 growing slowly in compute.
In semantic graph-based formulations, compute-optimal scaling arises from LDPC iterative decoding and matching the critical threshold for coverage in a bipartite “concept–text” network, yielding $(N^*, D^*)$ 8 (Nayak et al., 2024).

These theories predict emergent phenomena—such as performance plateaus and sharp appearance of new skills—arising when giant components (“skills”) percolate in the random graph as $(N^*, D^*)$ 9 increases.

6. Compute-Optimal Data Curation and Finetuning

When the data collection process involves significant computational cost, as in expensive data selection or synthetic data generation, compute-optimal strategies must account for both selection and training cost (Bansal et al., 2024, Yin et al., 2024):

For synthetic data, repeated sampling from a weaker, cheaper generator (lower per-example FLOPs) yields greater coverage/diversity and better downstream scores for fixed budget, provided false positive rate is controlled. The transition to sampling from a strong generator is only favorably crossed when these benefits are demonstrably saturated.
In data selection for finetuning, the optimal dataset size is $N D = C$ 0, where $N D = C$ 1 is the per-token selection cost of method $N D = C$ 2. Simple methods (BM25, Embed) are almost always compute-optimal at moderate budgets; high-cost selectors (perplexity, gradients) only become worthwhile when the train-to-selector model size ratio is $N D = C$ 3 or $N D = C$ 4.

Table 2: Compute-Optimal Data Generation Strategies

Setting	Optimal Allocation	Condition
Synthetic data (weak/strong)	All budget to weaker generator (WC)	Coverage/diversity WC $N D = C$ 5SE for rescaled $N D = C$ 6
Data selection (finetuning)	Max tokens with lowest-cost selector	High-cost methods only optimal at scale

7. Practical Recommendations and Caveats

For LLMs: The fixed-ratio Chinchilla law ( $N D = C$ 7) holds robustly for pretraining, provided compute is accurately accounted and hyperparameters are tailored per scale (Dey et al., 2023, Porian et al., 2024).
For specialized domains (e.g., protein LMs): Exponents may differ; do not blindly import NLP scaling laws, but measure domain-specific loss surfaces and account for loss plateaus and data limits (Serrano et al., 2024, Cheng et al., 2024).
Skill composition: Compute-optimal dataset/parameter split is task-dependent. When optimizing for multiple skills, composite validation sets must represent real-world priorities; otherwise, the optimum may shift by up to $N D = C$ 8 in parameter count due to validation set misspecification (Roberts et al., 13 Mar 2025).
Data quality and single-epoch limits: In regimes where unique data is exhausted, further gains require larger $N D = C$ 9 and increased compute, with little benefit from repeating tokens.
Extrapolation: Scaling laws are empirical and generally validated up to scales of $L(N, D) = A N^{-\alpha} + B D^{-\beta}$ 0 FLOPs; extrapolation beyond this, or to new domains/architectures, should be approached with caution.

References:

(Hoffmann et al., 2022) Training Compute-Optimal LLMs
(Dey et al., 2023) Cerebras-GPT: Open Compute-Optimal LLMs Trained on the Cerebras Wafer-Scale Cluster
(Porian et al., 2024) Resolving Discrepancies in Compute-Optimal Scaling of LLMs
(Guo, 2024) More Compute Is What You Need
(Jeon et al., 2022) An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws
(Nayak et al., 2024) An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in LLMs
(Serrano et al., 2024) Are Protein LLMs Compute Optimal?
(Cheng et al., 2024) Training Compute-Optimal Protein LLMs
(Yin et al., 2024) Compute-Constrained Data Selection
(Roberts et al., 13 Mar 2025) Compute Optimal Scaling of Skills: Knowledge vs Reasoning
(Bansal et al., 2024) Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling