Derive scaling laws for alignment pretraining

Determine the precise scaling behavior of alignment pretraining interventions as a function of model size, data quantity, and training compute, including whether small fixed data mixtures can reliably influence alignment priors at scale and how effects interact with increased post-training FLOPS.

Background

The experiments are conducted on 6.9B-parameter models due to resource constraints. Prior evidence suggests that pretraining priors may have stronger effects at larger scales, but the exact scaling of safety-focused pretraining interventions is not characterized.

Formal scaling laws would guide practitioners on how much alignment-targeted data and compute are required to achieve specified alignment outcomes across model scales.

References

Although evidence from suggests that the effect of pretraining priors increases with model size and data quantity, the precise scaling behaviour of safety interventions at pretraining remains unknown.

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment  (2601.10160 - Tice et al., 15 Jan 2026) in Section 7, Future Work – Scaling Laws for Alignment Pretraining