Nearest Neighbor Normalization Improves Multimodal Retrieval

Published 31 Oct 2024 in cs.CV, cs.AI, and cs.CL | (2410.24114v1)

Abstract: Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.

Abstract PDF HTML Upgrade to Chat

References (22)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a training-free Nearest Neighbor Normalization method that boosts retrieval accuracy by adjusting scores based on k-nearest neighbors.
The method significantly improves metrics such as Recall@1, Recall@5, and Recall@10 across models like CLIP on datasets such as MS-COCO.
The approach offers a low-complexity, practical alternative to retraining, making multimodal retrieval systems more efficient and bias-aware.

Nearest Neighbor Normalization for Enhanced Multimodal Retrieval

The paper entitled "Nearest Neighbor Normalization Improves Multimodal Retrieval" introduces an innovative approach to enhancing the performance of contrastive multimodal models on tasks such as image and text retrieval, emphasizing a method called Nearest Neighbor Normalization (NNN). This study methodically evaluates the performance improvements rendered by NNN across various well-known models and datasets, providing a comprehensive assessment of its application and efficacy.

Overview

The central focus of this research is on contrastive image-text retrieval models, which rely on contrastive loss functions to produce joint embeddings for text and images. Despite their effectiveness, these models often suffer from inefficiencies in retrieval performance due to challenges such as suboptimal embedding spaces and high-dimensional hubness—wherein certain retrieval candidates disproportionately appear as nearest neighbors to many queries. Existing methods to address these inefficiencies typically involve computationally expensive processes or additional training.

The authors propose a training-free solution, NNN, that bypasses these computational demands by introducing an additive correction to retrieval scores based on bias estimates from k-nearest neighbors within a reference query database. This adjustment improves retrieval accuracy without necessitating further model retraining. The authors show the applicability of this method across multiple models, such as CLIP, BLIP, ALBEF, SigLIP, and BEiT, and datasets including MS-COCO and Flickr30k.

Numerical Analysis

NNN achieves quantifiable improvements across various models and retrieval tasks. In particular, consistent gains are observed for metrics such as Recall@1, Recall@5, and Recall@10 for both text and image retrieval. For instance, CLIP's performance on COCO, in terms of Image Recall@1, increased from 30.45% to 37.77%, and Text Recall@1 improved from 50.02% to 54.16% on the same dataset. The results demonstrate that NNN can effectively alleviate the hubness problem and enhance retrieval performance without the substantial overhead associated with alternative methods.

Theoretical Contributions and Practical Implications

The authors advance theoretical understanding by showcasing that retrieval accuracy can be improved with minimal computational complexity. This is achieved by adjusting retrieval scores through an effective normalization that considers only the k-nearest query embeddings, improving over more demanding techniques like DBNorm and QBNorm. This sublinear time complexity approach makes it suitable for deployment in limited-compute environments, which is critical for applications involving large-scale retrieval systems.

On a practical level, NNN's capability to operate without additional training makes it particularly attractive for models where fine-tuning is infeasible due to resource constraints or where only black-box access to the model is available. Additionally, the method demonstrates potential in reducing harmful biases, such as gender bias, in retrieval outputs, which suggests its usefulness in creating more equitable AI systems.

Conclusion and Future Directions

In conclusion, Nearest Neighbor Normalization presents a compelling method for improving the retrieval performance of contrastive multimodal models with low computational overhead. While the paper provides substantial empirical support for NNN's efficacy, future work could explore extending this method to other retrieval contexts or integrating more sophisticated bias correction mechanisms. Furthermore, applications could be broadened by incorporating NNN into newer models with state-of-the-art performance, potentially revealing more opportunities to refine multimodal retrieval systems.

Markdown Report Issue