Optimality of vanilla CLIP features for multimodal LLMs
Determine whether visual features derived from a standard CLIP-style contrastive vision encoder (i.e., vanilla CLIP features) are the best choice of visual representation for multimodal large language models.
References
This common practice raises an important open question: Is using vanilla CLIP features the best visual representation for multimodal LLMs?
— CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
(2604.03231 - Deria et al., 3 Apr 2026) in Section 1 (Introduction)