Designing a self-supervised objective for high-level language representations

Develop a self-supervised pretraining objective specifically tailored to high-level language representations that operates beyond token-level prediction, enabling language models to learn and predict in an abstract representation space rather than directly over discrete tokens.

Background

The paper surveys approaches that move beyond token-level modeling, such as patch-based compression and latent reasoning methods, but notes these methods typically retain standard next token or byte prediction objectives.

It highlights a gap in the field: while high-level abstractions are increasingly modeled, a principled self-supervised pretraining objective designed explicitly for high-level language representations remains unsettled. The authors propose Next Concept Prediction (NCP) as one attempt to address this gap by predicting discrete concepts in a quantized latent space, yet they frame the broader design of such objectives as an open challenge.

References

Despite these advancements, designing a self-supervised pretraining objective specifically for high-level language representations remains an open challenge.

Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models  (2602.08984 - Liu et al., 9 Feb 2026) in Section 6.2 (Related Work: Abstract-level Modeling in Language Models)