Optimal ordering of the safety curriculum during pretraining

Determine the optimal temporal ordering for introducing safety pretraining interventions—specifically contextualized rephrasing, refusal training, and metadata-annotated examples enabling SafeBeam—within the pretraining token budget for large language models.

Background

The paper studies when to introduce safety interventions during pretraining, varying the start time at 0%, 20%, 60%, or 100% of a 600B-token budget while keeping the underlying data sources fixed. These interventions include contextualized rephrasing of harmful content, refusal training via request–refusal pairs, and metadata tagging that enables SafeBeam inference-time filtering.

The authors note a limitation: different introduction times lead to unequal exposure to synthetic contextualized and refusal data across models, meaning the exact number of tokens from each mixture source is not controlled. Against this backdrop, they explicitly state that finding an optimal ordering for the curriculum remains an open question.

References

Finding an optimal 'ordering' for a curriculum remains an interesting open question.

— When Should We Introduce Safety Interventions During Pretraining? (2601.07087 - Sam et al., 11 Jan 2026) in Additional Discussion, Appendix

Optimal ordering of the safety curriculum during pretraining

Background

References

Related Problems