Understanding the interaction between CPT and SFT
Characterize the interaction between continued pretraining (CPT) and supervised finetuning (SFT) for long-context vision-language models and determine why these procedures do not compose additively across many benchmarks.
References
The interaction between CPT and SFT remains incompletely understood: they do not compose additively across many benchmarks, suggesting opportunities for mixed-stage training or replay mechanisms.
— How to Train Your Long-Context Visual Document Model
(2602.15257 - Veselka, 16 Feb 2026) in Conclusion — Limitations and future work