Pre-training fraction needed to prevent catastrophic forgetting during finetuning

Determine the fraction of pre-training data that must be included in the fine-tuning (mid-training) mix to sufficiently prevent catastrophic forgetting of pre-trained capabilities in Large Language Models while enabling adaptation to specialized tasks.

Background

Catastrophic forgetting is a well-known issue when models are fine-tuned on specialized data, often mitigated by mixing in pre-training data during finetuning. Large-scale practices commonly include substantial fractions of pre-training data in mid-training mixes to preserve base capabilities.

Despite widespread use, the precise fraction of pre-training data necessary to effectively prevent forgetting has not been established. The authors introduce relative critical sharpness to analyze this trade-off and provide empirical guidance, but they explicitly state the uncertainty regarding the required proportion.

References

However, it remains unclear what fraction of pre-training data is sufficient to effectively prevent catastrophic forgetting.

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs  (2601.16979 - Kalra et al., 23 Jan 2026) in Section 4 (How much Pre-training data is needed to avoid Catastrophic forgetting?), first paragraph