Papers
Topics
Authors
Recent
Search
2000 character limit reached

Perplexity-Aware Data Scaling Law

Updated 1 January 2026
  • The paper introduces a perplexity-aware scaling law that quantifies data informativeness using the mean and variance of perplexity.
  • It employs a modified mathematical framework to adapt classical scaling laws for continual pre-training, optimizing data subset selection with a Distance-to-Optimum Selection (DOS) algorithm.
  • Empirical results demonstrate consistent improvements in model performance and faster convergence on both medical and general-domain benchmarks.

The Perplexity-Aware Data Scaling Law provides a predictive framework for modeling test loss in continual pre-training (CPT) processes by quantifying data informativeness through statistical properties of perplexity. Unlike classical scaling laws, which express generalization loss purely as a function of dataset size and model capacity, the perplexity-aware extension introduces the mean and variance of domain-specific perplexity as key parameters, thereby enabling more efficient data subset selection and adaptive sampling for foundation model adaptation (Liu et al., 25 Dec 2025).

1. Mathematical Formulation

The classical data scaling law for LLMs, given fixed model parameters NN and token count DD, takes the form

L^(N,D)=E+ANα+BDβ\hat L(N,D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

where EE is the irreducible entropy, AA and BB are fit constants, and α\alpha, β\beta denote scaling exponents. The perplexity-aware extension modifies the loss to

L^(μ,σ,D)=E+Dc(μαμ(σ)×σασ(μ))DαD\hat L(\mu,\sigma,D) = E + \frac{D_c}{\left(\mu^{\alpha_\mu(\sigma)} \times \sigma^{\alpha_\sigma(\mu)}\right) D^{\alpha_D}}

with μ\mu and σ2\sigma^2 as the mean and variance of the perplexity landscape, DcD_c a fitted constant, and interaction exponents αμ(σ)=α0+α1σ\alpha_\mu(\sigma)=\alpha_0+\alpha_1\sigma, ασ(μ)=β0+β1μ\alpha_\sigma(\mu)=\beta_0+\beta_1\mu specific to the interplay of μ\mu and σ\sigma. This form encodes both independence and interaction effects: μα0σβ0×μα1σσβ1μ×DαD\mu^{\alpha_0}\,\sigma^{\beta_0} \times \mu^{\alpha_1 \sigma}\,\sigma^{\beta_1 \mu} \times D^{\alpha_D} A plausible implication is that loss convergence is tied to the quality and diversity of domain data rather than quantity alone.

2. Perplexity Landscape: Definition and Computation

Perplexity for a sequence xx evaluated by a frozen base model is

PPL(x)=exp(1Tt=1Tlogp(xtx<t))\mathrm{PPL}(x) = \exp\left(-\frac{1}{T} \sum_{t=1}^T \log p(x_t \mid x_{<t})\right)

with TT as the sequence length. The perplexity landscape arises from a forward pass over all corpus chunks {cj}\{c_j\}, yielding the distribution

μ=E[PPL(cj)],σ2=Var[PPL(cj)]\mu = \mathbb{E}[\mathrm{PPL}(c_j)], \quad \sigma^2 = \mathrm{Var}[\mathrm{PPL}(c_j)]

Low-perplexity regions indicate known or easily modeled content; extremely high-perplexity regions suggest noise or out-of-domain examples. Empirically, the optimal subset for adaptation maximizes coverage of moderate perplexity—this “sweet spot” optimizes information gain (Liu et al., 25 Dec 2025).

3. Parameter Fitting and Loss Prediction

Fitting the law involves bootstrapping multiple training subsets {Si}\{S_i\} from the corpus, each with different (μi,σi2)(\mu_i,\sigma^2_i), then continually pre-training the base model and recording post-adaptation test losses {Li}\{L_i\}:

1
2
3
4
5
6
7
Input: Full domain corpus C, budget D, base model M
1. Sample K bootstrap subsets {S_i} from C, each size D, covering diverse (μ_i,σ_i^2).
2. For each S_i: pre-train M, evaluate L_i.
3. Split {(μ_i,σ_i,D,L_i)} into train/validation.
4. Solve min_θ Σ_train (L_i  L̂(μ_i,σ_i,D;θ))^2.
5. Validate; tune if necessary.
Output: fitted θ.
Published experiments used PubMed with Qwen3-0.6B and 14B model scales, confirming law generality and convergence across medical and general-domain benchmarks (Liu et al., 25 Dec 2025).

4. Adaptive Data Subset Selection Algorithm

Given the fitted law, the selection objective J(S)J(\mathcal S) for a subset S\mathcal S under token budget TbudgetT_{budget} is

J(S)=wμ(μ(S)μ^)2+wσ(σ2(S)σ^2)2J(\mathcal S) = w_\mu (\mu(\mathcal S) - \hat\mu)^2 + w_\sigma (\sigma^2(\mathcal S) - \hat\sigma^2)^2

subject to cjScjTbudget\sum_{c_j \in \mathcal S} |c_j| \leq T_\text{budget}. Since exact minimization is computationally infeasible, a greedy algorithm sequentially adds chunks closest to optimal (μ^,σ^2)(\hat\mu, \hat\sigma^2) until the budget is exhausted. This approach—termed Distance-to-Optimum Selection (DOS)—prioritizes data for maximum utility and diversity, aligning with the loss-driven "sweet spot" in the perplexity landscape (Liu et al., 25 Dec 2025).

5. Empirical Findings and Experimental Results

A controlled comparison of CPT strategies on Qwen3-14B-Base demonstrates the practical efficacy of DOS. Key results (accuracy, %) include:

Task Base RS LPS HPS DOS
DiagnosisArena 41.3 56.4 56.4 45.7 61.11
GPQA-Med 57.89 57.89 57.89 57.89 63.16
Medical Avg. 69.32 71.34 71.22 70.56 72.48
General Avg. 84.16 83.94 83.99 83.94 84.16

DOS yields a consistent +3.16 percentage point improvement on medical tasks relative to baseline and converges faster to lower test-loss values, as confirmed by both loss curves and t-SNE visualizations capturing selection diversity and intermediate-PPL focus (Liu et al., 25 Dec 2025). A plausible implication is that the DOS strategy is robust across data types and model scales.

6. Extensions, Context, and Generalization

The perplexity-aware scaling law extends power-law scaling frameworks described in models with random-feature maps and spectral power laws (Maloney et al., 2022), with the additional consideration of data quality and informativeness in the large-data, large-model regime. Once hyperparameters (α0,α1,β0,β1,αD)(\alpha_0, \alpha_1, \beta_0, \beta_1, \alpha_D) are fitted for a given domain, the law supports efficient prediction of optimal subset statistics and rapid sampler deployment for new domains. This suggests a broader utility in continual domain adaptation and principled data curation, with the law remaining stable across distinct application areas and model scales. The requirement of only a single forward pass for PPL estimation enhances scalability for practical deployment (Liu et al., 25 Dec 2025).

7. Relationship to Classical Scaling Laws and Theoretical Models

Classical scaling laws, modeled and analyzed in (Maloney et al., 2022), describe loss as a sum of power-law terms in dataset size and parameter count, with finite phase transitions marking bottleneck regimes. The incorporation of perplexity statistics in (Liu et al., 25 Dec 2025) operationalizes the spectral quality of the data distribution and enables finer-grained loss prediction. Theoretical implications include the alignment of the empirical sweet spot with latent cutoff-induced plateaus in loss scaling. This law may thereby provide a pathway for synthesizing empirical and theoretical analyses of data/model tradeoffs in large-scale foundation model adaptation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Perplexity-Aware Data Scaling Law.