Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration

Published 23 May 2023 in cs.CL | (2305.14324v2)

Abstract: Kendall's tau is frequently used to meta-evaluate how well machine translation (MT) evaluation metrics score individual translations. Its focus on pairwise score comparisons is intuitive but raises the question of how ties should be handled, a gray area that has motivated different variants in the literature. We demonstrate that, in settings like modern MT meta-evaluation, existing variants have weaknesses arising from their handling of ties, and in some situations can even be gamed. We propose instead to meta-evaluate metrics with a version of pairwise accuracy that gives metrics credit for correctly predicting ties, in combination with a tie calibration procedure that automatically introduces ties into metric scores, enabling fair comparison between metrics that do and do not predict ties. We argue and provide experimental evidence that these modifications lead to fairer ranking-based assessments of metric performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Results of the WMT17 Metrics Shared Task. In Proceedings of the Second Conference on Machine Translation, pages 489–513, Copenhagen, Denmark. Association for Computational Linguistics.
  2. Language Models are Few-Shot Learners. Advances in neural information processing systems, 33:1877–1901.
  3. Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17–53, Uppsala, Sweden. Association for Computational Linguistics.
  4. Quality-Aware Decoding for Neural Machine Translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1396–1412, Seattle, United States. Association for Computational Linguistics.
  5. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  6. High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics. Transactions of the Association for Computational Linguistics, 10:811–825.
  7. Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  8. Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Online. Association for Computational Linguistics.
  9. Maurice G Kendall. 1938. A New Measure of Rank Correlation. Biometrika, 30(1/2):81–93.
  10. Maurice G Kendall. 1945. The Treatment of Ties in Ranking Problems. Biometrika, 33(3):239–251.
  11. Tom Kocmi and Christian Federmann. 2023. Large Language Models Are State-of-the-Art Evaluators of Translation Quality.
  12. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation. In Proceedings of the Sixth Conference on Machine Translation, pages 478–494, Online. Association for Computational Linguistics.
  13. Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics. Tradumàtica, (12):0455–463.
  14. Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 671–688, Belgium, Brussels. Association for Computational Linguistics.
  15. Matouš Macháček and Ondřej Bojar. 2013. Results of the WMT13 Metrics Shared Task. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 45–51, Sofia, Bulgaria. Association for Computational Linguistics.
  16. Matouš Macháček and Ondřej Bojar. 2014. Results of the WMT14 Metrics Shared Task. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 293–301, Baltimore, Maryland, USA. Association for Computational Linguistics.
  17. Tangled up in bleu: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997.
  18. Evaluating Machine Translation Output with Automatic Sentence Segmentation. In Proceedings of the Second International Workshop on Spoken Language Translation, Pittsburgh, Pennsylvania, USA.
  19. MaTESe: Machine Translation Evaluation as a Sequence Tagging Problem. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 569–577, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  20. COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  21. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  22. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  23. Alan Stuart. 1953. The Estimation and Comparison of Strengths of Association in Contingency Tables. Biometrika, 40(1/2):105–110.
Citations (30)

Summary

  • The paper introduces a novel pairwise accuracy framework with tie calibration to overcome limitations of Kendall’s tau in MT metric evaluation.
  • It employs an epsilon threshold to standardize tie predictions, ensuring a fair and robust comparison across metrics.
  • Results on WMT'22 data demonstrate that tie calibration reshuffles metric rankings, underscoring the importance of accounting for ties.

Overview of the "Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration"

The paper "Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration" presents a novel approach to address the limitations of Kendall's τ\tau in the context of machine translation (MT) metric evaluation. By focusing on pairwise accuracy and introducing a tie calibration procedure, the authors propose a framework that allows for more accurate and fair comparison between metrics that predict ties and those that do not.

Motivation and Background

Kendall's τ\tau, a widely used statistic in the meta-evaluation of MT metrics, has been identified to have significant weaknesses due to its handling of ties. With advancements in MT systems and metrics, such as LLM-based and MQM-modeling metrics, ties have become more prevalent. This motivates a re-evaluation of how metrics are assessed, particularly focusing on metrics' ability to predict ties accurately.

The reliance on Kendall's τ\tau has shortcomings in scenarios where ties are significant, such as in evaluating perfect translations and smaller discrepancies between predictions. These scenarios are increasingly common as MT systems improve, creating the need for a metric evaluation methodology that accounts for ties more robustly.

Proposed Approach

The authors propose to use pairwise accuracy instead of Kendall's τ\tau, giving metrics credit for correctly predicting both rankings and ties. This approach simplifies the interpretation of metric performance, framing it as the proportion of correctly ranked pairs, including correct tie predictions.

Tie Calibration

A pivotal contribution is the tie calibration algorithm, which optimizes metric correlations by introducing an appropriate number of ties into scores. It searches for an ϵ\epsilon value such that pairs of translations with score differences less than ϵ\epsilon are tied, thereby normalizing metrics that rarely predict ties against those that do.

The algorithm's efficiency is O(n2logn)O(n^2 \log n), and it guarantees that all metrics evaluated use the same basis for comparison, overcoming biases inherent in metric designs.

Implementation and Results

The paper demonstrates that traditional variants of Kendall's τ\tau are biased, either favoring metrics that predict many ties or penalizing them. Pairwise accuracy, enhanced with tie calibration, offers a balanced evaluation approach. The metrics' rankings are significantly influenced by these adjustments. When applied to WMT'22 data, the method shows how previously top-ranking metrics under Kendall's τb\tau_b can rank differently when assessed with pairwise accuracy. Figure 1

Figure 1: Pearson's r, Spearman's rho, and Kendall's tau_b calculated between hypothetical human and metric scores.

The results underscore that not accounting for ties can lead to misleading conclusions about a metric's efficacy. By standardizing the evaluation procedure, the authors call for more nuanced assessment methods that better reflect metric performance across translation qualities. Figure 2

Figure 2: Dividing the Metric-X scores into equal width buckets illustrates irregular comparisons arising from NaN scores in correlations.

Discussion and Implications

The proposed pairwise accuracy with tie calibration offers a methodological advancement that allows for fairer, more representative assessments of MT metrics. It aligns well with scenarios where both exact ranking and tie prediction are crucial for understanding translation quality. Figure 3

Figure 3: The generalization of the selected epsilon^

(dashed line) across datasets depends on dataset specifics.*

The implications extend beyond MT, suggesting potential applications in broader AI metric evaluations where score distributions introduce ties due to model characteristics or task nature. This development prompts further exploration into optimizing and generalizing tie thresholds across datasets and tasks.

Conclusion

In conclusion, this research advocates for a new standard in metric evaluation that appropriately factors in ties. By addressing the oversight in Kendall's τ\tau handling of ties, the authors contribute a robust tool for evaluating AI systems where precise score differentiation is infeasible or undesirable. This work does not only refine MT metric evaluation but also presents a framework applicable to other domains requiring nuanced metric assessments. Figure 4

Figure 4: Score distributions where ties are introduced reveal biases toward tie predictions for perfect translations.

Figure 5

Figure 5: F1_1 scores for predicting ties or correct pair rankings for COMET-22 demonstrate higher efficacy in predicting ties.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 3 likes about this paper.