Explain why alignment loss outperforms ranking-based objectives

Establish whether the observed superiority of pointwise cosine alignment over ranking-based distillation losses in training NanoVDR text-only students distilled from the Qwen3-VL-Embedding-2B teacher arises from the high quality of the teacher’s embedding space, specifically that well-structured teacher coordinates enable direct spatial alignment to capture richer geometric information than relative ranking losses.

Background

NanoVDR trains a text-only student to map queries into a frozen vision-language teacher’s embedding space. The authors systematically compared six distillation objectives across three backbones and 22 ViDoRe datasets, observing that pointwise cosine alignment—which directly matches student and teacher query embeddings—consistently outperforms ranking-based and contrastive alternatives.

They explicitly conjecture that alignment’s advantage is due to the teacher’s high-quality, well-structured embedding space, which makes direct spatial alignment more informative than relative ranking. Validating this causal explanation would clarify when alignment-based distillation should be preferred and how teacher embedding quality mediates distillation success.

References

We conjecture that alignment's advantage stems from the high quality of our teacher's embedding space: when the teacher provides well-structured coordinates, direct spatial alignment exploits richer geometric signal than relative ranking alone.

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval  (2603.12824 - Liu et al., 13 Mar 2026) in Section 6.1 (The Monotonic Superiority of Spatial Alignment)