Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

Published 29 Apr 2021 in cs.CL, cs.AI, and cs.LG | (2104.14478v1)

Abstract: Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.

Abstract PDF Upgrade to Chat

Citations (335)

View on Semantic Scholar

Summary

The paper’s main contribution is a large-scale comparison of MQM expert evaluations and crowd-sourced assessments, revealing significant discrepancies in MT system rankings.
It shows that MQM, which leverages professional translators and full-document context, uncovers critical major accuracy errors in machine translation outputs.
The study also finds that advanced automatic metrics based on pre-trained embeddings better align with expert assessments, suggesting more reliable evaluation alternatives.

A Formal Examination of Human Evaluation Methodologies for Machine Translation

The paper "Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation" conducts an exhaustive exploration of human evaluation techniques applied to machine translation (MT) systems, focusing on the discrepancies in system rankings derived from different evaluation practices. The central thesis posits that current human evaluation methods—particularly those that employ untrained crowd workers—might yield unreliable assessments, potentially leading to erroneous conclusions about MT quality, including claims of human parity.

Methodology and Results

The study employs the Multidimensional Quality Metrics (MQM) framework as a rigorous basis for evaluation. An extensive data set from the WMT 2020 shared task is utilized, involving English $\to$ German and Chinese $\to$ English language pairs. Unlike casual evaluations, MQM requires professional translators and emphasizes full document context, ensuring that evaluations are grounded in detailed error analysis. The research highlights several key findings:

MQM versus Crowd-Sourced Evaluations: The MQM framework diverges significantly in its system rankings compared to those produced by WMT crowd workers. Notably, human translations are rated higher than machine outputs when assessed with MQM, suggesting that previous evaluations claiming human parity may be premature or incorrect.
Performance of Automatic Metrics: The paper observes that some automatic evaluation metrics, particularly those based on pre-trained embeddings, outperform crowd worker evaluations in aligning with MQM rankings. This implies that more sophisticated automatic approaches could serve as a more reliable alternative to untrained human evaluations.
Error Distribution and Analysis: Through MQM, a fine-grained analysis of the types of errors present in MT versus human translations reveals a predominance of major accuracy errors in MT systems. This indicates the domains where MT systems require further improvement and suggests areas for targeted computational research.
Implications for Future Evaluations: The study provides recommendations on the number of MQM ratings necessary to achieve reliable system rankings. It concludes that MQM should be preferred, particularly as MT systems approach higher-quality outputs where nuanced distinctions between outputs need to be assessed accurately.

Implications and Future Directions

The implications of this study are manifold. Practically, it suggests that MT evaluations in large-scale tasks should increasingly rely on frameworks like MQM, which involve expert annotators and emphasize document-level context. Theoretically, it underscores the need to refine error taxonomies within MT systems, suggesting that research should continue to focus not only on reducing major accuracy errors but also on understanding the nuances of translation quality that professional human translators can detect.

Looking towards the future, researchers are encouraged to leverage the publicly released corpus from this study to develop even more advanced automatic metrics which may eventually close the gap between human and machine assessments. The study also implies that as MT approaches human-level translation quality, evaluation methodologies must be refined concurrently to ensure nuanced and contextually informed assessments.

Conclusion

In sum, the paper provides a thorough and empirically grounded critique of traditional human evaluation methods for MT. By advocating for the MQM framework and revealing the limitations of crowd-sourced evaluations, the authors contribute significantly to the discourse on improving evaluation standards, thus facilitating more accurate assessments of MT progress. This work is pivotal for guiding future research in machine translation evaluation, urging the community to adopt and integrate more reliable and context-aware evaluation practices.