Interplay of Machine Translation, Diacritics, and Diacritization
Abstract: We investigate two research questions: (1) how do machine translation (MT) and diacritization influence the performance of each other in a multi-task learning setting (2) the effect of keeping (vs. removing) diacritics on MT performance. We examine these two questions in both high-resource (HR) and low-resource (LR) settings across 55 different languages (36 African languages and 19 European languages). For (1), results show that diacritization significantly benefits MT in the LR scenario, doubling or even tripling performance for some languages, but harms MT in the HR scenario. We find that MT harms diacritization in LR but benefits significantly in HR for some languages. For (2), MT performance is similar regardless of diacritics being kept or removed. In addition, we propose two classes of metrics to measure the complexity of a diacritical system, finding these metrics to correlate positively with the performance of our diacritization models. Overall, our work provides insights for developing MT and diacritization systems under different data size conditions and may have implications that generalize beyond the 55 languages we investigate.
- Gheith Abandah and Asma Abdel-Karim. 2020. Accurate and fast recurrent neural network solution for the automatic diacritization of arabic text. Jordanian Journal of Computers and Information Technology, 6(2).
- Automatic diacritization of arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition (IJDAR), 18:183–197.
- Ife Adebara and Muhammad Abdul-Mageed. 2022. Towards afrocentric NLP for African languages: Where we are and where we can go. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.
- The effect of domain and diacritics in Yoruba–English neural machine translation. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 61–75, Virtual. Association for Machine Translation in the Americas.
- Investigating the impact of various partial diacritization schemes on Arabic-English statistical machine translation. In Conferences of the Association for Machine Translation in the Americas: MT Researchers’ Track, pages 191–204, Austin, TX, USA. The Association for Machine Translation in the Americas.
- Efficient convolutional neural networks for diacritic restoration. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1442–1448, Hong Kong, China. Association for Computational Linguistics.
- Martin J. Ball. 2001. On the status of diacritics. Journal of the International Phonetic Association, 31(2):259–264.
- Yonatan Belinkov and James Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2281–2285, Lisbon, Portugal. Association for Computational Linguistics.
- Steven Bird. 1999. Strategies for representing tone in african writing systems. Written language and literacy, 2(1):1–44.
- Jacob Cohen. 1977. Statistical power analysis for the behavioral sciences, rev.
- Geoff Cumming. 2013. Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge.
- Hindi-to-Urdu machine translation through transliteration. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 465–474, Uppsala, Sweden. Association for Computational Linguistics.
- Arabic text diacritization using deep neural networks.
- Osaama Hamed and Torsten Zesch. 2017. A survey and comparative study of arabic diacritization tools. Journal for Language Technology and Computational Linguistics, 32(1):27–47.
- Larry M Hyman. 2016. Lexical vs. grammatical tone: sorting out the differences. Tonal Aspects Lang, 2016:6–11.
- Artur Janicki and Piotr Herman. 2005. Reconstruction of polish diacritics in a text-to-speech system. In INTERSPEECH, pages 1489–1492.
- Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
- Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization.
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
- Are goats chèvres, chévres, chēvres, and chevres? unveiling the orthographic code of diacritical vowels. Journal of experimental psychology. Learning, memory, and cognition, 49(2):301–319.
- László János Laki and Zijian Gyozo Yang. 2020. Automatic diacritic restoration with transformer model based neural machine translation for east-central european languages. In ICAI, pages 190–202.
- Thomas Mayer and Michael Cysouw. 2014. Creating a massively parallel Bible corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3158–3163, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Rada F. Mihalcea. 2002. Diacritics restoration: Learning from letters versus learning from words. In Computational Linguistics and Intelligent Text Processing, pages 339–348, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2390–2395, Minneapolis, Minnesota. Association for Computational Linguistics.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Automatic diacritic restoration for resource-scarce languages. In International Conference on Text, Speech and Dialogue, pages 170–179. Springer.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Athanassios Protopapas and Svetlana Gerakaki. 2009. Development of processing stress diacritics in reading greek. Scientific Studies of Reading, 13(6):453–483.
- Improving Arabic diacritization with regularized decoding and adversarial training. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 534–542, Online. Association for Computational Linguistics.
- David Roberts. 2009. Visual crowding and the tone orthography of african languages. Written Language & Literacy, 12(1):140–155.
- Shlomo S Sawilowsky. 2009. New effect size rules of thumb. Journal of modern applied statistical methods, 8(2):26.
- Diacritization as a machine translation and as a sequence labeling problem. In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop, pages 270–278, Waikiki, USA. Association for Machine Translation in the Americas.
- Edinburgh neural machine translation systems for WMT 16. CoRR, abs/1606.02891.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Simple fusion: Return of the language model. CoRR, abs/1809.00125.
- Brian Thompson and Ali Alshehri. 2022. Improving Arabic diacritization by learning to diacritize and translate. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 11–21, Dublin, Ireland (in-person and online). Association for Computational Linguistics.
- Attention is all you need. Advances in neural information processing systems, 30.
- John C Wells. 2000. Orthographic diacritics and multilingual computing. Language problems and language planning, 24(3):249–272.
- Edinburgh’s statistical machine translation systems for WMT16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 399–410, Berlin, Germany. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.