Papers
Topics
Authors
Recent
Search
2000 character limit reached

To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation

Published 2 Jan 2024 in cs.CL | (2401.01419v1)

Abstract: We conduct a large-scale fine-grained comparative analysis of machine translations (MT) against human translations (HT) through the lens of morphosyntactic divergence. Across three language pairs and two types of divergence defined as the structural difference between the source and the target, MT is consistently more conservative than HT, with less morphosyntactic diversity, more convergent patterns, and more one-to-one alignments. Through analysis on different decoding algorithms, we attribute this discrepancy to the use of beam search that biases MT towards more convergent patterns. This bias is most amplified when the convergent pattern appears around 50% of the time in training data. Lastly, we show that for a majority of morphosyntactic divergences, their presence in HT is correlated with decreased MT performance, presenting a greater challenge for MT systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Automatic detection of machine translated text and translation quality estimation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 289–295, Baltimore, Maryland. Association for Computational Linguistics.
  2. Target-side augmentation for document-level machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10725–10742, Toronto, Canada. Association for Computational Linguistics.
  3. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
  4. The University of Edinburgh’s submissions to the WMT19 news translation task. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 103–115, Florence, Italy. Association for Computational Linguistics.
  5. How human is machine translationese? comparing human and machine translations of text and speech. In Proceedings of the 17th International Conference on Spoken Language Translation, pages 280–290, Online. Association for Computational Linguistics.
  6. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal. Association for Computational Linguistics.
  7. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
  8. Eleftheria Briakou and Marine Carpuat. 2020. Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1563–1580, Online. Association for Computational Linguistics.
  9. Exploring diversity in back translation for low-resource machine translation. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 67–79, Hybrid. Association for Computational Linguistics.
  10. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.
  11. Detecting cross-lingual semantic divergence for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 69–79, Vancouver. Association for Computational Linguistics.
  12. Translation divergences in chinese–english machine translation: An empirical investigation. Computational Linguistics, 43:521–565.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  14. B. Dorr. 1994. Machine translation divergences: A formal description and proposed solution. Comput. Linguistics, 20:597–633.
  15. Bonnie J Dorr. 1992. The use of lexical semantics in interlingual machine translation. Machine Translation, 7(3):135–193.
  16. Bonnie J Dorr. 1993. Interlingual machine translation a parameterized approach. Artificial Intelligence, 63(1-2):429–492.
  17. Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In International Conference on Learning Representations.
  18. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500, Brussels, Belgium. Association for Computational Linguistics.
  19. Bryan Eikema and Wilker Aziz. 2020. Is MAP decoding all you need? the inadequacy of the mode in neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4506–4520, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  20. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
  21. APE at scale and its implications on MT evaluation biases. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 34–44, Florence, Italy. Association for Computational Linguistics.
  22. BLEU might be guilty but references are not innocent. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 61–71, Online. Association for Computational Linguistics.
  23. A natural diet: Towards improving naturalness of machine translation output. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3340–3353, Dublin, Ireland. Association for Computational Linguistics.
  24. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115:E3635 – E3644.
  25. Second language acquisition: An introductory course. Routledge.
  26. Martin Gellerstam. 1986. Translationese in swedish novels translated from english. Translation studies in Scandinavia, 1:88–95.
  27. A systematic exploration of diversity in machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1100–1111.
  28. Deepa Gupta and Niladri Chatterjee. 2001. Study of divergence for example based english-hindi machine translation. STRANS-2001, IIT Kanpur, pages 43–51.
  29. Deepa Gupta and Niladri Chatterjee. 2003. Identification of divergence for english to hindi ebmt. In MTSUMMIT.
  30. The curious case of neural text degeneration. In International Conference on Learning Representations.
  31. Explicit alignment objectives for multilingual bidirectional encoders. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3633–3643, Online. Association for Computational Linguistics.
  32. Simulated multiple reference training improves low-resource machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 82–89, Online. Association for Computational Linguistics.
  33. Moshe Koppel and Noam Ordan. 2011. Translationese and its dialects. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1318–1326, Portland, Oregon, USA. Association for Computational Linguistics.
  34. Automatic detection of translated text and its impact on machine translation. In Proceedings of Machine Translation Summit XII: Papers, Ottawa, Canada.
  35. Target conditioning for one-to-many generation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2853–2862, Online. Association for Computational Linguistics.
  36. Adapting translation models to translationese improves SMT. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 255–265, Avignon, France. Association for Computational Linguistics.
  37. Mixup decoding for diverse machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 312–320.
  38. A simple, fast diverse decoding algorithm for neural generation. ArXiv, abs/1611.08562.
  39. Bridging the gap between training and inference: Multi-candidate optimization for diverse neural machine translation. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2622–2632, Seattle, United States. Association for Computational Linguistics.
  40. On systematic style differences between unsupervised and supervised MT and an application for high-resource machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2214–2225, Seattle, United States. Association for Computational Linguistics.
  41. Vimal Mishra and Ravi Bhushan Mishra. 2009. Divergence patterns between english and sanskrit machine translation. INFOCOMP Journal of Computer Science, 8:62–71.
  42. Facebook FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 314–319, Florence, Italy. Association for Computational Linguistics.
  43. Fine-grained analysis of cross-linguistic syntactic divergences. In ACL.
  44. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666.
  45. Analyzing uncertainty in neural machine translation. In International Conference on Machine Learning, pages 3956–3965. PMLR.
  46. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  47. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  48. Do GPTs produce less literal translations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1041–1050, Toronto, Canada. Association for Computational Linguistics.
  49. Translationese as a language in “multilingual” NMT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7737–7746, Online. Association for Computational Linguistics.
  50. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189.
  51. Decoding and diversity in machine translation. In NeurIPS 2020 Workshop on Resistance AI.
  52. Abdus Saboor and Mohammad Abid Khan. 2010. Lexical-semantic divergence in urdu-to-english example based machine translation. 2010 6th International Conference on Emerging Technologies (ICET), pages 316–320.
  53. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  54. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR.
  55. Mixture models for diverse machine translation: Tricks of the trade. In International conference on machine learning, pages 5719–5728. PMLR.
  56. Generating diverse translations with sentence codes. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1823–1827, Florence, Italy. Association for Computational Linguistics.
  57. Translation divergence in english-hindi mt. In EAMT.
  58. Selecting backtranslated data from multiple sources for improved neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3898–3908, Online. Association for Computational Linguistics.
  59. Generating diverse translation by manipulating multi-head attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8976–8983.
  60. Antonio Toral. 2019. Post-editese: an exacerbated translationese. In Proceedings of Machine Translation Summit XVII: Research Track, pages 273–281, Dublin, Ireland. European Association for Machine Translation.
  61. Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2203–2213, Online. Association for Computational Linguistics.
  62. Lost in translation: Loss and decay of linguistic richness in machine translation. In Proceedings of Machine Translation Summit XVII: Research Track, pages 222–232, Dublin, Ireland. European Association for Machine Translation.
  63. Attention is all you need. Advances in neural information processing systems, 30.
  64. Prompting PaLM for translation: Assessing strategies and performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15406–15427, Toronto, Canada. Association for Computational Linguistics.
  65. Identifying semantic divergences in parallel text without annotations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1503–1515, New Orleans, Louisiana. Association for Computational Linguistics.
  66. Chinese syntactic reordering for statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 737–745, Prague, Czech Republic. Association for Computational Linguistics.
  67. Shira Wein and Nathan Schneider. 2021. Classifying divergences in cross-lingual amr pairs. Proceedings of The Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop.
  68. Quantitative comparative syntax on the cantonese-mandarin parallel dependency treebank. In Proceedings of the fourth international conference on Dependency Linguistics (Depling 2017), pages 266–275.
  69. Generating diverse translation from model distribution with dropout. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1088–1097.
  70. CoNLL 2017 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–19, Vancouver, Canada. Association for Computational Linguistics.
  71. Jiajun Zhang and Chengqing Zong. 2016. Exploiting source-side monolingual data in neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1535–1545, Austin, Texas. Association for Computational Linguistics.
  72. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copenhagen, Denmark. Association for Computational Linguistics.
  73. Handling syntactic divergence in low-resource machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1388–1394, Hong Kong, China. Association for Computational Linguistics.
Citations (3)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.