- The paper demonstrates that cosine similarity aligns with Pearson's correlation when embeddings have near-zero mean dimensions.
- It empirically shows that rank-based correlations like Spearman's ρ and Kendall's τ outperform cosine similarity for non-normally distributed embeddings.
- The findings suggest adopting rank correlations can improve semantic similarity measures and influence future NLP embedding training.
Correlation Coefficients and Semantic Textual Similarity
Introduction
The paper addresses a noteworthy gap in the study of semantic textual similarity (STS), focusing on how similarity is measured between embeddings rather than the embeddings themselves. While robust embeddings, such as those from models like word2vec, GloVe, and fastText, have been widely explored, the default choice for measuring similarity has often been cosine similarity, without much scrutiny. The authors aim to rigorously examine the validity of using cosine similarity, drawing connections to Pearson's correlation coefficient and proposing when and why alternatives like rank-based correlations might provide better performance.
Statistical Examination of Embedding Similarity
The authors establish that for common word vectors, cosine similarity effectively behaves like Pearson correlation. This equivalence stems from the observation that the mean of individual word vector dimensions across popular embeddings is approximately zero, thus aligning cosine similarity with Pearson correlation in a statistical sense.
However, the paper highlights a critical nuance: the applicability of Pearson's correlation depends on the normality of the vector distributions. Pearson's r is susceptible to outliers, which can distort similarity measurement. For embeddings that do not conform to normal distributions—especially noticeable in some dimensions—rank-based correlation coefficients such as Spearman's ρ and Kendall's τ offer a robust alternative.
Empirical Validation
The authors conducted extensive experiments on both word-level and sentence-level STS tasks using different embeddings including GloVe, fastText, and word2vec. The results corroborate their hypothesis: when embeddings deviate from normality, rank-based similarity measures outperform cosine similarity.
- Word-level Analysis: Experiments reveal that cosine similarity and Pearson correlation yield similar results. However, in cases where embeddings, such as those from GloVe, exhibit significant non-normality, rank correlations deliver improved performance.
- Sentence-level STS: The analysis shows a pronounced advantage for rank-based measures over cosine similarity when averaging word vectors from models that are primarily non-normally distributed, such as fastText. This suggests that even in composite contexts like averaged sentence representations, rank correlations maintain robustness where traditional measures falter.
Implications and Future Directions
The findings of this work provide a clear directive for better similarity measures, indicating that practitioners should incorporate rank-based correlations in scenarios prone to embedding non-normalities. This approach can enhance the semantic evaluation performance across several NLP tasks. Furthermore, the paper hints at a deeper landscape involving the training objectives and methodologies that might inherently influence vector normality.
By bridging existing gaps in understanding and leveraging statistical rigor, this work contributes significantly towards optimizing similarity metrics in STS applications. The future focus should remain on exploring the intrinsic properties of word embeddings that contribute to their statistical abnormalities, potentially leading to improvements in embedding training paradigms and methodologies.
Conclusion
This paper presents a structured statistical perspective on why alternatives to cosine similarity, such as rank correlation coefficients, are sometimes more suited for STS tasks. The work argues convincingly for a shift in the default approach to measuring embedding similarity, with demonstrated empirical benefits. This contribution is crucial for advancing both theoretical and practical natural language processing frameworks.