Distinctiveness of NSP supervision beyond NTP
Establish that the next-sequence prediction objective provides sequence-level learning signals not captured by the next-token prediction objective when training fast weight language models, thereby explaining observed improvements in next-token prediction accuracy under mid-training.
References
We conjecture that the NSP objective's sequence-level supervision provides learning signals that NTP does not.
— Reinforced Fast Weights with Next-Sequence Prediction
(2602.16704 - Hwang et al., 18 Feb 2026) in Section 4.2 (Impact of ReFINE on Mid-Training)