Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Published 23 May 2025 in cs.LG, cs.AI, and cs.CL | (2505.18091v1)

Abstract: LLMs are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that mixing knowledge-dense data with web scrapes causes sudden, phase-transition-like improvements in learning based on model size.
It reveals that beyond a critical mixing ratio or model size threshold, LLMs rapidly shift from minimal to substantial memory recall of dense datasets.
The study validates a power-law relationship and proposes mitigation strategies like random subsampling and CKM to boost knowledge acquisition under low mixing ratios.

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

LLMs are typically trained using data mixtures composed of large-scale web scrapes and smaller, highly curated datasets containing dense domain-specific knowledge. The paper "Data Mixing Can Induce Phase Transitions in Knowledge Acquisition" investigates how these mixtures affect knowledge acquisition and reveals that, unlike when training exclusively on knowledge-dense data, knowledge acquisition from such mixtures can exhibit phase transitions with respect to the mixing ratio and model size.

Phase Transitions in Knowledge Acquisition

Phase Transition in Model Size

The study shows that as the model size increases beyond a certain threshold, the model rapidly transitions from memorizing few to most of the biographies in a synthetic dataset mixed with web data. Below the threshold, the model memorizes almost nothing, implying a sudden change in knowledge acquisition capability once this threshold is surpassed.

Figure 1: Phase transition in model size. For each mixing ratio, as model size increases, accuracy initially remains zero. Once model size surpasses some threshold, accuracy rapidly grows to over 60\%.

Phase Transition in Mixing Ratio

Similarly, fixing the model size, as the mixing ratio increases beyond a critical value, the model's ability to memorize the dataset dramatically improves. Below this critical mixing ratio, extensive training does not help, and models memorize almost nothing regardless of the number of training iterations.

Figure 2: Phase transition in mixing ratio persists for larger models. We train Pythia-2.8B and 6.9B with 2B and 1B total training tokens, respectively. Smaller SynBio datasets mixed with FineWeb-Edu were used to ensure adequate exposure.

Theoretical Insights

The paper attributes these phase transitions to a capacity allocation phenomenon, which resembles solving a knapsack problem. The theoretical analysis suggests that a model with bounded capacity must allocate its resources optimally across different datasets to minimize overall test loss. The optimal distribution changes discontinuously as model size or mixing ratio varies. These transitions were found to follow a predictable pattern, governed by a power-law relationship between model size and mixing ratio.

Figure 3: An illustration of the intuition behind our theory.

Experimental Validation and Strategies

Power-Law Relationship in Threshold Frequency

Experiments validate a power-law relationship where the threshold frequency, above which a fact is learned, decreases as model size increases, approximately following a power-law exponent that relates to the model scaling exponent plus one.

Figure 4: Validating the power-law relationship of threshold Frequency and model size.

Mitigation Strategies

Two strategies are proposed to boost knowledge acquisition, especially under low mixing ratios:

Random Subsampling: Randomly subsample the knowledge-dense dataset to increase exposure frequency.
Compact Knowledge Mixing (CKM): Rephrase knowledge into compact forms and augment the original dataset.

These methods aim to increase the "marginal value" of the dense dataset within the mixture, resulting in significant improvements in knowledge recall without compromising general model performance.

Figure 5: Our proposed strategies significantly boost knowledge acquisition under low mixing ratios while preserving models' general capability.

Conclusion

The paper "Data Mixing Can Induce Phase Transitions in Knowledge Acquisition" provides valuable insights into the training dynamics of LLMs when using mixed datasets. Phase transitions observed in model size and mixing ratios challenge the conventional understanding that larger models necessarily acquire more knowledge linearly. These findings highlight the importance of carefully configuring data mixtures to optimize LLM training and suggest that strategies like subsampling and compact knowledge mixing may offer substantial benefits. Future AI applications involving LLMs could leverage these insights to refine training methodologies, thereby enhancing model performance across various tasks.

Markdown Report Issue