Correlation Coefficients and Semantic Textual Similarity

Published 19 May 2019 in cs.CL, cs.LG, and stat.ML | (1905.07790v1)

Abstract: A large body of research into semantic textual similarity has focused on constructing state-of-the-art embeddings using sophisticated modelling, careful choice of learning signals and many clever tricks. By contrast, little attention has been devoted to similarity measures between these embeddings, with cosine similarity being used unquestionably in the majority of cases. In this work, we illustrate that for all common word vectors, cosine similarity is essentially equivalent to the Pearson correlation coefficient, which provides some justification for its use. We thoroughly characterise cases where Pearson correlation (and thus cosine similarity) is unfit as similarity measure. Importantly, we show that Pearson correlation is appropriate for some word vectors but not others. When it is not appropriate, we illustrate how common non-parametric rank correlation coefficients can be used instead to significantly improve performance. We support our analysis with a series of evaluations on word-level and sentence-level semantic textual similarity benchmarks. On the latter, we show that even the simplest averaged word vectors compared by rank correlation easily rival the strongest deep representations compared by cosine similarity.

Abstract PDF Upgrade to Chat

Citations (43)

View on Semantic Scholar

Summary

The paper demonstrates that cosine similarity aligns with Pearson's correlation when embeddings have near-zero mean dimensions.
It empirically shows that rank-based correlations like Spearman's ρ and Kendall's τ outperform cosine similarity for non-normally distributed embeddings.
The findings suggest adopting rank correlations can improve semantic similarity measures and influence future NLP embedding training.

Correlation Coefficients and Semantic Textual Similarity

Introduction

The paper addresses a noteworthy gap in the study of semantic textual similarity (STS), focusing on how similarity is measured between embeddings rather than the embeddings themselves. While robust embeddings, such as those from models like word2vec, GloVe, and fastText, have been widely explored, the default choice for measuring similarity has often been cosine similarity, without much scrutiny. The authors aim to rigorously examine the validity of using cosine similarity, drawing connections to Pearson's correlation coefficient and proposing when and why alternatives like rank-based correlations might provide better performance.

Statistical Examination of Embedding Similarity

The authors establish that for common word vectors, cosine similarity effectively behaves like Pearson correlation. This equivalence stems from the observation that the mean of individual word vector dimensions across popular embeddings is approximately zero, thus aligning cosine similarity with Pearson correlation in a statistical sense.

However, the paper highlights a critical nuance: the applicability of Pearson's correlation depends on the normality of the vector distributions. Pearson's $r$ is susceptible to outliers, which can distort similarity measurement. For embeddings that do not conform to normal distributions—especially noticeable in some dimensions—rank-based correlation coefficients such as Spearman's $\rho$ and Kendall's $\tau$ offer a robust alternative.

Empirical Validation

The authors conducted extensive experiments on both word-level and sentence-level STS tasks using different embeddings including GloVe, fastText, and word2vec. The results corroborate their hypothesis: when embeddings deviate from normality, rank-based similarity measures outperform cosine similarity.

Word-level Analysis: Experiments reveal that cosine similarity and Pearson correlation yield similar results. However, in cases where embeddings, such as those from GloVe, exhibit significant non-normality, rank correlations deliver improved performance.
Sentence-level STS: The analysis shows a pronounced advantage for rank-based measures over cosine similarity when averaging word vectors from models that are primarily non-normally distributed, such as fastText. This suggests that even in composite contexts like averaged sentence representations, rank correlations maintain robustness where traditional measures falter.

Implications and Future Directions

The findings of this work provide a clear directive for better similarity measures, indicating that practitioners should incorporate rank-based correlations in scenarios prone to embedding non-normalities. This approach can enhance the semantic evaluation performance across several NLP tasks. Furthermore, the paper hints at a deeper landscape involving the training objectives and methodologies that might inherently influence vector normality.

By bridging existing gaps in understanding and leveraging statistical rigor, this work contributes significantly towards optimizing similarity metrics in STS applications. The future focus should remain on exploring the intrinsic properties of word embeddings that contribute to their statistical abnormalities, potentially leading to improvements in embedding training paradigms and methodologies.

Conclusion

This paper presents a structured statistical perspective on why alternatives to cosine similarity, such as rank correlation coefficients, are sometimes more suited for STS tasks. The work argues convincingly for a shift in the default approach to measuring embedding similarity, with demonstrated empirical benefits. This contribution is crucial for advancing both theoretical and practical natural language processing frameworks.