Perplexity-Aware Data Scaling Law
- The paper introduces a perplexity-aware scaling law that quantifies data informativeness using the mean and variance of perplexity.
- It employs a modified mathematical framework to adapt classical scaling laws for continual pre-training, optimizing data subset selection with a Distance-to-Optimum Selection (DOS) algorithm.
- Empirical results demonstrate consistent improvements in model performance and faster convergence on both medical and general-domain benchmarks.
The Perplexity-Aware Data Scaling Law provides a predictive framework for modeling test loss in continual pre-training (CPT) processes by quantifying data informativeness through statistical properties of perplexity. Unlike classical scaling laws, which express generalization loss purely as a function of dataset size and model capacity, the perplexity-aware extension introduces the mean and variance of domain-specific perplexity as key parameters, thereby enabling more efficient data subset selection and adaptive sampling for foundation model adaptation (Liu et al., 25 Dec 2025).
1. Mathematical Formulation
The classical data scaling law for LLMs, given fixed model parameters and token count , takes the form
where is the irreducible entropy, and are fit constants, and , denote scaling exponents. The perplexity-aware extension modifies the loss to
with and as the mean and variance of the perplexity landscape, a fitted constant, and interaction exponents , specific to the interplay of and . This form encodes both independence and interaction effects: A plausible implication is that loss convergence is tied to the quality and diversity of domain data rather than quantity alone.
2. Perplexity Landscape: Definition and Computation
Perplexity for a sequence evaluated by a frozen base model is
with as the sequence length. The perplexity landscape arises from a forward pass over all corpus chunks , yielding the distribution
Low-perplexity regions indicate known or easily modeled content; extremely high-perplexity regions suggest noise or out-of-domain examples. Empirically, the optimal subset for adaptation maximizes coverage of moderate perplexity—this “sweet spot” optimizes information gain (Liu et al., 25 Dec 2025).
3. Parameter Fitting and Loss Prediction
Fitting the law involves bootstrapping multiple training subsets from the corpus, each with different , then continually pre-training the base model and recording post-adaptation test losses :
1 2 3 4 5 6 7 |
Input: Full domain corpus C, budget D, base model M 1. Sample K bootstrap subsets {S_i} from C, each size D, covering diverse (μ_i,σ_i^2). 2. For each S_i: pre-train M, evaluate L_i. 3. Split {(μ_i,σ_i,D,L_i)} into train/validation. 4. Solve min_θ Σ_train (L_i – L̂(μ_i,σ_i,D;θ))^2. 5. Validate; tune if necessary. Output: fitted θ. |
4. Adaptive Data Subset Selection Algorithm
Given the fitted law, the selection objective for a subset under token budget is
subject to . Since exact minimization is computationally infeasible, a greedy algorithm sequentially adds chunks closest to optimal until the budget is exhausted. This approach—termed Distance-to-Optimum Selection (DOS)—prioritizes data for maximum utility and diversity, aligning with the loss-driven "sweet spot" in the perplexity landscape (Liu et al., 25 Dec 2025).
5. Empirical Findings and Experimental Results
A controlled comparison of CPT strategies on Qwen3-14B-Base demonstrates the practical efficacy of DOS. Key results (accuracy, %) include:
| Task | Base | RS | LPS | HPS | DOS |
|---|---|---|---|---|---|
| DiagnosisArena | 41.3 | 56.4 | 56.4 | 45.7 | 61.11 |
| GPQA-Med | 57.89 | 57.89 | 57.89 | 57.89 | 63.16 |
| Medical Avg. | 69.32 | 71.34 | 71.22 | 70.56 | 72.48 |
| General Avg. | 84.16 | 83.94 | 83.99 | 83.94 | 84.16 |
DOS yields a consistent +3.16 percentage point improvement on medical tasks relative to baseline and converges faster to lower test-loss values, as confirmed by both loss curves and t-SNE visualizations capturing selection diversity and intermediate-PPL focus (Liu et al., 25 Dec 2025). A plausible implication is that the DOS strategy is robust across data types and model scales.
6. Extensions, Context, and Generalization
The perplexity-aware scaling law extends power-law scaling frameworks described in models with random-feature maps and spectral power laws (Maloney et al., 2022), with the additional consideration of data quality and informativeness in the large-data, large-model regime. Once hyperparameters are fitted for a given domain, the law supports efficient prediction of optimal subset statistics and rapid sampler deployment for new domains. This suggests a broader utility in continual domain adaptation and principled data curation, with the law remaining stable across distinct application areas and model scales. The requirement of only a single forward pass for PPL estimation enhances scalability for practical deployment (Liu et al., 25 Dec 2025).
7. Relationship to Classical Scaling Laws and Theoretical Models
Classical scaling laws, modeled and analyzed in (Maloney et al., 2022), describe loss as a sum of power-law terms in dataset size and parameter count, with finite phase transitions marking bottleneck regimes. The incorporation of perplexity statistics in (Liu et al., 25 Dec 2025) operationalizes the spectral quality of the data distribution and enables finer-grained loss prediction. Theoretical implications include the alignment of the empirical sweet spot with latent cutoff-induced plateaus in loss scaling. This law may thereby provide a pathway for synthesizing empirical and theoretical analyses of data/model tradeoffs in large-scale foundation model adaptation.