Understanding the interaction between CPT and SFT

Characterize the interaction between continued pretraining (CPT) and supervised finetuning (SFT) for long-context vision-language models and determine why these procedures do not compose additively across many benchmarks.

Background

The study finds that CPT can extend context length and improve long-context text performance, while SFT and preference optimization provide strong gains for long-document visual question answering. Empirically, however, CPT and SFT often fail to combine additively, raising questions about how these training phases interact.

The authors highlight this as an unresolved issue and suggest that mixed-stage training or replay mechanisms might help, indicating a need to investigate when and why CPT and SFT interfere or fail to compound gains across evaluation suites.

References

The interaction between CPT and SFT remains incompletely understood: they do not compose additively across many benchmarks, suggesting opportunities for mixed-stage training or replay mechanisms.

How to Train Your Long-Context Visual Document Model  (2602.15257 - Veselka, 16 Feb 2026) in Conclusion — Limitations and future work