Nearest Neighbor Normalization Improves Multimodal Retrieval
Abstract: Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
- A prompt array keeps the bias away: Debiasing vision-language models with adversarial learning. AACL.
- Cross modal retrieval with querybank normalisation.
- The faiss library.
- Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution. NeurIPS Datasets and Benchmarks.
- Retrieval-enhanced contrastive vision-text models. arXiv.
- Scaling up visual and vision-language representation learning with noisy text supervision. ICML.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv.
- Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS.
- Microsoft coco: Common objects in context. ECCV.
- Learning transferable visual models from natural language supervision. arXiv.
- Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11(sept):2487–2531.
- Ad-clip: Adapting domains in prompt space using clip. ICCV.
- Are gender-neutral queries really gender-neutral? mitigating gender bias in image search. arXiv.
- Fairclip: Social bias elimination based on attribute prototype learning and representation neutralization. arXiv.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv.
- Balance act: Mitigating hubness in cross-modal retrieval with query and gallery banks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10542–10567, Singapore. Association for Computational Linguistics.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL.
- Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv.
- Sigmoid loss for language image pre-training. arXiv.
- Contrastive learning of medical visual representations from paired images and text. Machine Learning for Healthcare Conference.
- Learning to prompt for vision-language models. IJCV.
- Test-time distribution normalization for contrastively learned vision-language models. NeurIPS.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.