Implicit query expansion effect of pre-training prompts in ColBERT

Determine whether the inclusion of the "search_query:" and "search_document:" prompt tokens during Nomic Embed contrastive pre-training acts as implicit query expansion in ColBERT late-interaction retrieval models, and quantify the extent to which this mechanism contributes to improved retrieval performance relative to models trained without prompts or using only the [Q]/[D] markers.

Background

The paper studies how prompts used in Nomic Embed pre-training ("search_query:" and "search_document:") affect ColBERT-style retrievers. The authors observe that removing these prompts during full ColBERT pre-training yields a significant performance drop, suggesting prompts provide benefits beyond simple identification, possibly interacting with model asymmetry.

They note that early ColBERT variants benefited from query expansion mechanisms and hypothesize that prompts could serve as placeholder tokens storing global information. Modern implementations using Flash Attention no longer support PAD-based query expansion, making it unclear whether prompts induce a similar effect in current architectures.

References

We conjecture this may be a form of implicit query expansion, a mechanism that has shown very useful in the early variant of ColBERT.

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models  (2602.16609 - Chaffin et al., 18 Feb 2026) in Section 3.2 (Impact of the Prompt)