Optimality of vanilla CLIP features for multimodal LLMs

Determine whether visual features derived from a standard CLIP-style contrastive vision encoder (i.e., vanilla CLIP features) are the best choice of visual representation for multimodal large language models.

Background

The paper notes that most multimodal LLMs use CLIP or CLIP-style encoders and often pass a single deep-layer representation into the LLM. While this setup is effective for global image–text alignment, it may underemphasize fine-grained cues such as localization and spatial relations, potentially limiting grounding performance.

Motivated by this concern, the authors introduce a complementary multi-encoder fusion (SigLIP2 + DINOv3) and present empirical gains over single-encoder baselines. Nevertheless, the general question of whether vanilla CLIP features provide the optimal visual representation for multimodal LLMs is explicitly posed as open.

References

This common practice raises an important open question: Is using vanilla CLIP features the best visual representation for multimodal LLMs?

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning  (2604.03231 - Deria et al., 3 Apr 2026) in Section 1 (Introduction)