Kaplan & Chinchilla Scaling Laws

Updated 8 February 2026

Kaplan and Chinchilla Scaling Laws are empirical and theoretical relationships that predict how test loss in neural language models scales with parameter count, dataset size, and compute cost.
They provide power-law formulas for optimal resource allocation, guiding the balance between increasing model size and training data to achieve efficient performance improvements.
Extensions incorporate architecture-specific factors like latency and sparsity, enabling principled LLM design and deployment under practical constraints.

Kaplan and Chinchilla Scaling Laws are a class of empirical and theoretical relationships that characterize how neural LLMs' test loss scales with respect to model parameter count, dataset size, compute cost, and, in modern extensions, architecture-dependent inference characteristics. These scaling laws provide quantitative recipes for allocating compute among model capacity and data, predict training behavior across orders of magnitude in scale, and now underpin principled design of LLM architectures under deployment constraints.

1. Origins and Mathematical Forms of Scaling Laws

The original scaling law formulation by Kaplan et al. (2020) established that trained transformer models exhibit predictable decreases in test cross-entropy loss as a power-law in both parameter size $N$ and number of unique training tokens $D$ : $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ where $A$ , $B$ , $E$ , $\alpha$ , and $\beta$ are empirical constants determined by fitting— $\alpha \approx 0.076$ , $\beta \approx 0.095$ for their original GPT-style models. Compute-optimal model size under the constraint $D$ 0 was derived as $D$ 1, $D$ 2, indicating an allocation favoring larger models.

Hoffmann et al. (2022) ("Chinchilla") revised these exponents using an expanded parameter and data regime, and by accounting for total (including embedding) parameter count and compute: $D$ 3 Their fitting yields $D$ 4, $D$ 5, $D$ 6, $D$ 7, $D$ 8, and critically, a compute-optimal allocation $D$ 9—implying a balanced "50:50" split between increasing model and data for fixed compute (Pearce et al., 2024, Porian et al., 2024).

Empirical tests and analytic reparametrizations, correcting for embedding layer and small-scale biases, demonstrate that Kaplan's originally steeper exponent is a finite- $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 0 artifact. Simulating Chinchilla under the original Kaplan conventions replicates the $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 1 exponent, with both analyses converging to $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 2 at large scale (Pearce et al., 2024, Porian et al., 2024).

2. Theoretical Explanations and Unification

Contemporary work provides rigorous theoretical underpinnings for these empirical power laws:

An information-theoretic analysis establishes, in Barron-like single-hidden-layer settings, tight upper bounds whose compute-optimal minimizer fulfills $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 3 (i.e., a linear scaling) matching Chinchilla's empirical law up to logarithmic corrections (Jeon et al., 2022).
The "Effective Frontier" framework abstracts task learning as progressive coverage of patterns from a long-tailed (Zipf) distribution. Loss is attributed to the mass of unlearned tail patterns above a cutoff $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 4 ("Effective Frontier"), which depends on the available resource $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 5:

$L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 6

Distinct scaling exponents emerge for $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 7, $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 8, and training steps $L(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E$ 9. The key result is a Max-Bottleneck principle: when multiple resource constraints apply (e.g., $A$ 0, $A$ 1, $A$ 2), loss is dictated by the slowest-decaying term

$A$ 3

Constrained optimization over $A$ 4, $A$ 5 with $A$ 6 yields the two limiting regimes: the Kaplan law (compute-limited, $A$ 7) and the Chinchilla law (data-limited, $A$ 8), reconciling them as equilibrium solutions of a unified scaling equation (Zou et al., 1 Feb 2026).

Subsequent analyses identified methodological artifacts impacting exponent estimates and fitting consistency:

Exclusion of embedding and output layers from $A$ 9 measures produced a $B$ 0 exponent at small scale; total parameter and compute accounting resolves this to Chinchilla's $B$ 1 (Pearce et al., 2024).
Warmup duration, last-layer computational cost, and optimizer hyperparameter scaling systematically skew power-law estimates. Incorporating per-run, size-adaptive warmup, counting all FLOPs (including head and embedding), and per-model optimizer tuning yields scaling curves with $B$ 2—matching Chinchilla (Porian et al., 2024).
Out-of-sample accuracy of parametric scaling forms is augmented by robust Huber-loss or L-BFGS fitting. Nonparametric ML regression (e.g., kernel or neural network surface for $B$ 3) further improves frontier estimation in some empirical settings (Barkeshli et al., 15 Jan 2026).

4. Extensions: Architecture, Latency, and Sparsity

Modern scaling law frameworks explicitly incorporate architecture- and efficiency-aware factors:

Latency and memory bandwidth: Empirical runtime is much better predicted by the volume of memory copy operations (dominant term in accelerator-bound environments) than by pure FLOP count. The closed-form throughput and loss prediction as a function of transformer hyperparameters allows analytic architecture optimization for fixed wall-clock time budgets, pointing to wide–shallow configurations as preferable for a given parameter count (Inbar et al., 2024).
Inference cost: Shape-aware scaling laws co-optimize $B$ 4, $B$ 5, and model "aspect ratio" $B$ 6, with loss penalized as

$B$ 7

These forms enable Pareto-efficient inference-optimized model selection and demonstrate that for a fixed $B$ 8, wider, shallower models achieve up to $B$ 9 faster inference at identical task accuracy, confirming model-shape-dependence in scaling (Bian et al., 30 Jan 2025).

Conditional scaling over architecture: Parameterizing loss exponents as functions of hidden size and MLP/attention ratio,

$E$ 0

yields models (e.g., "Panda" and "Surefire") that outperform baselines both in terms of loss and inference throughput (Bian et al., 21 Oct 2025).

Sparsity: The "average parameter count" over pre-training, $E$ 1, replaces static $E$ 2 in the Chinchilla law, unifying sparse and dense pre-training in a single scaling framework. Empirical results confirm that for matched $E$ 3 and $E$ 4, "lossless" compression at up to $E$ 5 sparsity yields no training or downstream accuracy deficit, decoupling training-time compute from inference speed (Jin et al., 21 Jan 2025).

5. Origin and Robustness of Scaling Laws

Recent theoretical and synthetic experiments reinforce the universality and origin of scaling laws:

Scaling exponents' values are not reducible solely to data distribution tails; robust power-law scaling is observed in synthetic settings devoid of Zipf structure (e.g., transformers on random walks), implying an architectural emergence of power-law spectra or optimization effects (Barkeshli et al., 15 Jan 2026).
Data complexity, as modulated via dataset generative process or architecture parameterization, alters the data exponent $E$ 6, while the parameter exponent $E$ 7 typically remains stable (around $E$ 8– $E$ 9 in 2-layer transformers and task-agnostic settings).
The necessity of accounting for irreducible offsets $\alpha$ 0 is emphasized for fit validity. One-dimensional power-law fits, including such offsets, outperform both fixed functional forms and exponentials in predictive accuracy.

6. Empirical Design Implications and Current Best Practices

Compiled results across multiple studies yield convergent recipes for LLM design under budget:

Use total (including embeddings) parameter and compute counts in all scaling law reporting (Pearce et al., 2024, Porian et al., 2024).
For compute-optimal training, allocate compute such that $\alpha$ 1 for $\alpha$ 2 sufficiently large, ensuring the "Chinchilla frontier" is approached (Pearce et al., 2024).
For practical deployment, augment Chinchilla or related scaling laws by incorporating shape (aspect ratio) or conditional architectural parameters to optimize for latency-constrained inference, with wider–shallower models yielding substantial efficiency gains for a fixed $\alpha$ 3 (Inbar et al., 2024, Bian et al., 30 Jan 2025, Bian et al., 21 Oct 2025).
In sparse pre-training, the effective parameter controlling train loss is the arithmetic mean $\alpha$ 4; use the average-parameter version of the scaling law for both dense and sparse regimes (Jin et al., 21 Jan 2025).
For high-complexity or high-dimensional input, theories predict that more of the budget should be channeled into increasing $\alpha$ 5 rather than $\alpha$ 6; this is supported by both analytic upper bound derivations and empirical sweeps (Jeon et al., 2022).
Nonparametric frontier fitting and sensitivity analysis over ( $\alpha$ 7, $\alpha$ 8) allow for more robust extrapolation at new scales and in atypical regimes (Barkeshli et al., 15 Jan 2026).

The combined legacy of Kaplan and Chinchilla scaling laws is a rigorously grounded, empirically validated, and now architecture-aware methodology for predicting LLM scaling and guiding efficient allocation of resources in pretraining and deployment. Extensions to account for architectural details, memory, and inference cost have established that contemporary model selection and scaling optimization can be performed entirely with closed-form, hyperparameter-only loss models (Inbar et al., 2024, Bian et al., 30 Jan 2025, Bian et al., 21 Oct 2025).