Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

Published 11 Jul 2024 in cs.IR | (2407.08275v1)

Abstract: The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between these models using Jaccard and rank similarity. We compare different families of embedding models, including proprietary ones, across five datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top-k retrieval similarity reveals high-variance at low k values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting the highest similarity to OpenAI models.

Abstract PDF HTML Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces novel methods like CKA to evaluate embedding similarity beyond traditional benchmarks.
It employs both representational and retrieval-based measures, including Jaccard and rank similarity, to assess model performance.
The study reveals intra- and inter-family clusters, offering actionable insights for selecting cost-effective, high-performing models.

Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

Introduction

The paper entitled "Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems" (2407.08275) aims to enhance the model selection process for Retrieval Augmented Generation (RAG) systems by identifying similarities among embedding models beyond traditional benchmark performance scores. RAG systems, which utilize external knowledge bases to improve the factual accuracy and currency of LLMs, rely heavily on the retrieval of relevant text documents. This requires effective embeddings which are produced by various LLMs, many of which display similar benchmark performance but may differ in actual retrieval outputs.

Methodologies

Pair-wise Embedding Similarity

To assess model similarity beyond benchmark scores, the authors employ Centered Kernel Alignment (CKA), a representational similarity measure, which allows for direct comparison between different embedding models without requiring identical representation spaces.

Figure 1: Mean CKA similarity across all five datasets. Models tend to be most similar to models belonging to their own family, though some interesting inter-family patterns are visible as well.

CKA calculates similarity based on embedding pair-wise comparability, offering a normalized score from zero (no similarity) to one (complete similarity).

Retrieval Similarity

Besides representational similarity, the paper assesses the functional similarity relevant to RAG applications by evaluating the overlap in retrieved text chunks. Two measures are utilized:

Jaccard Similarity: This indicates the percentage overlap between retrieved text chunks, providing insight into the practical extent of similarity between models.
Rank Similarity: Evaluates the rank order of overlapping retrievals, offering a deeper understanding of retrieval alignment beyond mere overlap.
Figure 2: Rank similarity over all $k$ on NFCorpus, comparing gte-large to all other models. Scores are highest and vary most for small $k$ , but then drop quickly before stabilizing for larger $k$ .

Experimental Setup

The authors examine embeddings across five datasets, chosen based on size and representational properties, to ensure broad applicability of findings. Models evaluated include both open-source and proprietary options, covering diverse families like e5, t5, bge, gte, OpenAI, and Cohere.

Results

Intra- and Inter-Family Clusters

The analysis reveals that models within the same family show high intra-family similarity, particularly evident in CKA scores. Models from the bge and gte families display noteworthy inter-family similarity, suggesting functional likeness in practical applications. Specific clusters exhibit almost perfect similarity, which could streamline the model selection process.

Figure 3: Jaccard similarity over all $k$ on NFCorpus, comparing bge-large (a) and gte-large (b) to all other models. While bge-large shows high similarity to UAE-Large-v1 and mxbai-embed-large-v1, scores for gte-large are clustered much closer. Jaccard similarity seems to be most unstable for small $k$ , which would commonly be chosen for retrieval tasks.

Discussion

The findings suggest significant practical implications for selecting embedding models in RAG systems. While pair-wise similarity offers insights into model representations, retrieval-based evaluation provides critical understanding relevant to real-world applications, particularly for small $k$ values. The paper highlights the strategic importance of considering both representational and retrieval similarity when choosing models, as this could greatly influence retrieval quality and system performance.

Figure 4: Jaccard (a) and rank similarity (b) for the top-10 retrieved text chunks averaged over 25 queries on NFCorpus. The clusters vary slightly depending on the measure, as do the scores. Models tend to be most similar to models from their own family. However, some inter-family clusters are visible as well.

Another crucial aspect of the study is identifying open-source alternatives to proprietary models, which could provide cost-effective solutions with similar functionality. It was observed that Mistral exhibits high similarity to OpenAI embedding models, suggesting its potential as a substitute.

Conclusion

By examining model similarity through both representational and retrieval-based perspectives, the paper sheds light on critical aspects of embedding model selection for RAG systems. The implications of this research extend to enhancing the efficiency and cost-effectiveness of RAG deployments. Furthermore, the findings advocate for a nuanced approach to model selection that transcends performance benchmarks and considers practical retrieval outcomes.

Figure 5: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on SciFact (a) and ArguAna (b). The UAE and mxbai models show high levels of similarity along with bge-large. The remaining models tend to show the highest similarity within their own family with the exception of the bge/gte inter-family cluster.