CPT Scaling Law for Continual Pre-Training

Updated 21 January 2026

CPT Scaling Law is a framework defining power-law relationships among model size, training tokens, and domain mixture to predict loss and optimize trade-offs.
It integrates laws like D-CPT, CMR, CPT Dynamics, Perplexity-aware, and PTPP-aware to guide data selection, mixture ratios, and training schedules.
Empirical validations across diverse LLM architectures demonstrate its ability to reduce compute costs while enhancing domain-specific performance and mitigating catastrophic forgetting.

CPT Scaling Law

The CPT Scaling Law refers to quantitative scaling relationships that govern the performance, efficiency, and hyperparameter optimization of Continual Pre-Training (CPT) in LLMs. In CPT, a pre-trained model is further trained (“pre-trained”) on additional general and/or domain-specific corpora to adapt its capabilities while balancing the retention of general knowledge with acquisition of new, specialized knowledge. The CPT Scaling Law framework encompasses several mathematical formulations that predict the impact of domain/general mixture ratio, compute spent, model size, pre-training budget, and data informativeness on downstream and general-domain loss surfaces, enabling principled control of catastrophic forgetting, adaptation efficiency, and hyperparameter tuning.

1. Mathematical Formulations of CPT Scaling Laws

Recent work has established several factual scaling law formulations for CPT, each with distinct functional forms and empirical motivations. A central principle in all is the existence of power-law or rational-law relationships among the core variables: model size ( $N$ ), continual-pretraining corpus size ( $D$ ), domain-data ratio ( $r_d$ or $m$ ), and sometimes additional factors such as data perplexity or tokens-per-parameter ( $T$ ). Below is an overview of representative laws:

D-CPT Law

The D-CPT Law models the validation loss $L$ on either domain or general corpus as follows (Que et al., 2024):

$L(N, D, r) = E + \frac{A}{N^{\alpha}} + \frac{B r^{\eta}}{D^{\beta}} + \frac{C}{(r + \epsilon)^{\gamma}}$

$N$ : model parameters (B)
$D$ : training tokens (B)
$r$ : mixture fraction (domain or general)
$E, A, B, C > 0$ : scale parameters; $\alpha, \beta, \gamma, \eta > 0$ ; $\epsilon > 0$ (small offset)

Through fitting and validation, the D-CPT Law enables prediction of the loss given arbitrary mixture, data scale, and model size, and supports optimization for trade-offs between generality and specialization.

CMR Scaling Law

The CMR Scaling Law formalizes the optimal (critical) mixture ratio $m^*$ for domain-specific data at given model size and data budget by power-law fits (Gu et al., 2024):

$m^*(T) = \alpha_4 T^{s_4} + \beta_3$

where $T$ is the total tokens, and $\alpha_4, s_4, \beta_3$ are fit per model.

This law derives $m^*$ by balancing domain loss minimization and constraining general loss increases under explicit Lagrangian formalism, thus specifying optimal resource allocation for target performance constraints.

CPT Dynamics Law

A dynamical scaling law describes general and domain loss trajectories over CPT training steps $t$ as a function of learning-rate schedule ( $\eta_t$ ) and replay mixture ratio, decoupling annealing and distribution-shift effects (Wang et al., 12 May 2025):

$L(t) = L_0 + A (S_{\mathrm{PT}} + S_{\mathrm{CPT}}(t))^{-\alpha} - C_1 A_{\mathrm{PT}} - C_2 A_{\mathrm{CPT}}(t) + B [1 - (1 + E S_{\mathrm{CPT}}(t))^{-\beta}]$

$S_{\mathrm{PT}}, S_{\mathrm{CPT}}$ : learning-rate "areas"
$A_{\mathrm{CPT}}$ , $A_{\mathrm{PT}}$ : annealing "areas"
$C_2$ and $B$ scale with replay ratio exponentially
$L_0$ , $A$ , $B$ , $C_1$ , $C_2$ , $\alpha$ , $\beta$ , $E$ are fit from data

This formulation allows prediction of loss for arbitrary CPT schedules, replay proportions, and training durations.

Perplexity-Aware Data Scaling Law

This law predicts CPT performance as a function of the knowledge gap in domain data, quantified by sample perplexity statistics ( $\mu$ , $\sigma$ ) under the base model (Liu et al., 25 Dec 2025). The test loss scales as:

$\hat L(\mu, \sigma, D) = E + \frac{D_c}{Q(\mu,\sigma) D^{\alpha_D}}$

where $Q(\mu,\sigma) = \mu^{\alpha_\mu(\sigma)} \sigma^{\alpha_\sigma(\mu)}$ and exponents $\alpha_\mu, \alpha_\sigma$ are themselves linear in $\sigma$ , $\mu$ .

This formulation provides a basis for data selection algorithms that maximize adaptation utility per token budget.

PTPP-Aware Adaptation Scaling Law

A recent generalization makes the pre-training budget (tokens-per-parameter, $T$ ) an explicit argument in the adaptation law, supporting accurate out-of-distribution forecasting for target loss at unseen pre-training budgets (Goffinet et al., 27 Oct 2025):

$\hat L(N, D, r, T) = E + \frac{A}{N^\alpha + B r^{\nu} D^{\beta_{\mathrm{eff}}(T)} + \frac{C}{(r+\varepsilon_r)^{\gamma} + \frac{F}{T^{\eta}}}}$

with $\beta_{\mathrm{eff}}(T) = \beta (1 - \lambda T^\zeta / (1 + T^\zeta))$ .

This law enables compute/replay constrained planning and accurate loss prediction when varying $T$ and $D$ jointly.

2. Empirical Validation, Fitting, and Domains

CPT scaling laws are established through systematic empirical campaigns across model architectures, domains, and data budgets:

D-CPT Law was validated on seven Qwen-1.5 transformer variants across six domains (math, code, chemistry, etc.), using hundreds of runs over domain-general mixture fractions $r \in [0,1]$ , $N \in \{0.5, 1.8, 4, 7\}$ B, and $D$ spanning $0.1$B to $26$B tokens (Que et al., 2024).
The CMR Scaling Law used four autoregressive Transformers (460M–3.1B params) and token budgets up to 20B, showing that the predicted optimal mixture fraction matches the empirical maximum feasible ratio within $<1\%$ absolute error (Gu et al., 2024).
The CPT Dynamics Law demonstrated predictive accuracy ( $R^2 > 0.99$ ) across learning-rate schedules, replay ratios, and model sizes on both general and domain validation sets (Wang et al., 12 May 2025).
Perplexity-aware scaling was validated for both medical and general benchmarks (e.g., Qwen3-14B-Base primed via 8K CPT steps), showing that data subsets selected via the scaling law achieve superior validation and task performance compared to random or naive perplexity-based selection (Liu et al., 25 Dec 2025).
PTPP-aware scaling laws accurately extrapolated adaptation loss at $T=279$ when fitted only on $T \in \{15,31\}$ , outperforming PTPP-agnostic transfer baselines by an order of magnitude in prediction error (Goffinet et al., 27 Oct 2025).

All laws are estimated by robust regression (e.g., Huber loss in log-space), with theoretical and practical constraints on exponents to maintain monotonicity and physical plausibility.

3. Optimization of Mixture Ratios, Data Scale, and Hyperparameters

CPT scaling laws enable the derivation of principled recipes for mixture ratio (domain vs. general), data size, and other hyperparameters:

Mixture Ratio ( $r_d$ or $m$ ): Both D-CPT and CMR scaling laws demonstrate that an optimal mixture ratio exists—often high (0.7–0.95) for medium-sized LLMs—balancing domain performance with general retention. The CMR law provides a simple power-law formula for $m^*$ as a function of data budget $T$ and model size $S$ , eliminating exhaustive grid search (Gu et al., 2024, Que et al., 2024).
Data Scale and Domain-Learnability: The D-CPT Cross-Domain law introduces a Domain-Learnability Coefficient $K$ that, once calibrated with a handful of pilot runs, predicts loss for new domains with high accuracy using only 1% additional compute.
Learning-Rate Schedule and Replay Ratio: The CPT Dynamics Law decouples the effect of learning-rate annealing and domain shift, exposing a direct tradeoff: higher replay (more general data) mitigates forgetting but slows domain adaptation and thus can be jointly optimized for a desired general-vs-domain loss tradeoff.
Compute-Aware Planning: PTPP-aware scaling makes tokens-per-parameter a variable, supporting precise adaptation under finite compute or memory. This is critical for scaling training in real-world resource contexts (Goffinet et al., 27 Oct 2025).
Data Subset Selection: Perplexity-aware law provides a framework for selecting maximally informative data under a fixed token budget by matching the predicted optimal perplexity statistics $(\hat\mu, \hat\sigma)$ of the training subset (Liu et al., 25 Dec 2025).

4. Theoretical Interpretation and Connections

CPT scaling laws translate observed neural scaling phenomena into a rigorous analytical framework. Their derivation is empirical, but several theoretical connections hold:

Power-law origins: The prevalence of power-law scaling (in $r, D, N, T$ ) is consistent with classical non-parametric learning theory and approximation-theoretic analysis, as established in the broader deep learning scaling literature (e.g., (Rosenfeld, 2021)).
Trade-off surfaces: CPT laws formalize trade-offs between catastrophic forgetting (degradation in general-domain loss) and domain adaptation benefit, operationalizing these as explicit Lagrangian or Pareto frontiers.
Decoupling of factors: CPT laws empirically decouple learning-rate annealing from distribution shift, mirroring hierarchical error decomposition in classical learning theory.
Domain difficulty: Cross-Domain D-CPT Law's Domain-Learnability Coefficient $K$ encapsulates domain hardness, akin to capacity or margin parameters in statistical learning theory.

A plausible implication is that future work could yield a unified multitask or multi-domain CPT scaling law combining these axes, as conjectured in recent extensions (Goffinet et al., 27 Oct 2025).

5. Practical Considerations, Limitations, and Extensions

Major practical findings and caveats from the literature include:

Efficiency gains: Scaling-law–driven CPT design reduces compute by orders of magnitude relative to brute-force grid search, including rapid law fitting, cross-domain extrapolation, and mix-ratio prediction with minimal pilot runs (Que et al., 2024, Gu et al., 2024).
Domain transferability: While mixture law parameters generalize well across several domains and tasks, strong out-of-distribution performance is subject to domain similarity and learnability as measured empirically.
Model/Corpus dependence: Most laws are fitted for a particular LLM architecture family (e.g., Qwen, ArXiv models); transfer across architectures or diverse corpora is likely but not strictly established and may alter exponents or constants.
Law breakdowns: Below empirical mixture “inflection’’ points (e.g., $r_i \lesssim 0.05$ ), monotonicity and power-law behavior can falter due to inadequacy of domain data or domain coverage.
Bayesian/uncertainty extensions: Uncertainty quantification via Bayesian or ensembling could enhance error bars for law-based predictions (Goffinet et al., 27 Oct 2025).
Metric limitations: Most results pertain to (cross-entropy) validation loss; extension to task-specific metrics (BLEU, F1, downstream scores) is an open direction.

6. Comparative Summary of Key CPT Scaling Law Papers

Law & Reference	Main Variables	Predictive Use Cases
D-CPT Law (Que et al., 2024)	$N$ , $D$ , $r$	Mixture/tradeoff prediction, law fitting, new domain
CMR Law (Gu et al., 2024)	$S$ , $T$ , $m$	Optimal domain ratio, loss-constrained balancing
CPT Dynamics (Wang et al., 12 May 2025)	$t$ , $\eta$ (LR), $r$	Loss forecasting under arbitrary CPT schedules
Perplexity-Aware (Liu et al., 25 Dec 2025)	$\mu$ , $\sigma$ , $D$	Adaptive data selection by informativeness
PTPP-aware (Goffinet et al., 27 Oct 2025)	$N$ , $D$ , $r$ , $T$	Extrapolation to unseen pretraining budget, replay

The convergence of these CPT scaling laws now enables loss surface prediction, compute-resource optimization, and hyperparameter tuning for continual pre-training of LLMs in both research and production settings. As domain boundaries blur and models scale further, these laws provide robust, predictive tools for principled CPT design.