Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

Published 11 Jan 2024 in cs.CL and cs.AI | (2401.05749v2)

Abstract: We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual LLMs on both monolingual and bilingual data scraped from the web.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Automatic detection of machine translated text and translation quality estimation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 289–295, Baltimore, Maryland. Association for Computational Linguistics.
  2. The falcon series of open language models. arXiv preprint arXiv:2311.16867.
  3. Yuki Arase and Ming Zhou. 2013. Machine translation detection from monolingual web-text. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1597–1607, Sofia, Bulgaria. Association for Computational Linguistics.
  4. Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
  5. ParaCrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4555–4567, Online. Association for Computational Linguistics.
  6. Dubbing in practice: A large scale study of human localization with insights for automatic dubbing. Transactions of the Association for Computational Linguistics, 11:419–435.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  8. Christian Buck and Philipp Koehn. 2016. Quick and reliable document alignment via TF/IDF-weighted cosine distance. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 672–678, Berlin, Germany. Association for Computational Linguistics.
  9. Low-resource corpus filtering using multilingual sentence embeddings. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 261–266, Florence, Italy. Association for Computational Linguistics.
  10. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  11. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. Kevin Duh. 2018. The multitarget ted talks task. http://www.cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/.
  13. Markus Freitag and Orhan Firat. 2020. Complete multilingual neural machine translation. In Proceedings of the Fifth Conference on Machine Translation, pages 550–560, Online. Association for Computational Linguistics.
  14. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Singapore. Association for Computational Linguistics.
  15. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Online. Association for Computational Linguistics.
  16. William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75–102.
  17. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  18. Federico Gaspari and John Hutchins. 2007. Online and free! ten years of online machine translation: origins, developments, current use and future prospects. In Proceedings of Machine Translation Summit XI: Papers, Copenhagen, Denmark.
  19. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
  20. Marcin Junczys-Dowmunt. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 888–895, Belgium, Brussels. Association for Computational Linguistics.
  21. Machine translation approaches and survey for indian languages. arXiv preprint arXiv:1701.04290.
  22. Huda Khayrallah and Philipp Koehn. 2018. On the impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74–83, Melbourne, Australia. Association for Computational Linguistics.
  23. Findings of the WMT 2020 shared task on parallel corpus filtering and alignment. In Proceedings of the Fifth Conference on Machine Translation, pages 726–742, Online. Association for Computational Linguistics.
  24. Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 54–72, Florence, Italy. Association for Computational Linguistics.
  25. Findings of the WMT 2018 shared task on parallel corpus filtering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739, Belgium, Brussels. Association for Computational Linguistics.
  26. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  27. Automatic detection of translated text and its impact on machine translation. In Proceedings of Machine Translation Summit XII: Papers, Ottawa, Canada.
  28. What language model to train if you have one million GPU hours? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 765–782, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  29. Understanding regional context of world wide web using common crawl corpus. In 2017 IEEE 13th Malaysia International Conference on Communications (MICC), pages 164–169.
  30. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  31. There’s no data like better data: Using QE metrics for MT data filtering. In Proceedings of the Eighth Conference on Machine Translation, pages 561–577, Singapore. Association for Computational Linguistics.
  32. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  33. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  34. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  35. Philip Resnik. 1998. Parallel strands: a preliminary investigation into mining the web for bilingual text. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 72–82, Langhorne, PA, USA. Springer.
  36. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29(3):349–380.
  37. CCMatrix: Mining billions of high-quality parallel sentences on the web. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6490–6500, Online. Association for Computational Linguistics.
  38. Rico Sennrich and Martin Volk. 2010. MT-based sentence alignment for OCR-generated parallel texts. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, Denver, Colorado, USA. Association for Machine Translation in the Americas.
  39. Findings of the WMT 2023 shared task on parallel data curation. In Proceedings of the Eighth Conference on Machine Translation, pages 95–102, Singapore. Association for Computational Linguistics.
  40. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137.
  41. Brian Thompson and Philipp Koehn. 2019. Vecalign: Improved sentence alignment in linear time and space. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1342–1348, Hong Kong, China. Association for Computational Linguistics.
  42. Brian Thompson and Philipp Koehn. 2020. Exploiting sentence order in document alignment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5997–6007, Online. Association for Computational Linguistics.
  43. Brian Thompson and Matt Post. 2020a. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 90–121, Online. Association for Computational Linguistics.
  44. Brian Thompson and Matt Post. 2020b. Paraphrase generation as zero-shot multilingual translation: Disentangling semantic similarity from lexical and syntactic diversity. In Proceedings of the Fifth Conference on Machine Translation, pages 561–570, Online. Association for Computational Linguistics.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Citations (11)

Summary

  • The paper demonstrates that a significant share of web content results from low-quality machine translations originally generated for ad revenue.
  • It constructs and analyzes a 6.4 billion sentence multi-way parallel corpus, revealing that translation quality decreases as more languages are added.
  • The study highlights critical implications for multilingual LLM training, emphasizing data quality challenges and the need to detect low-quality MT outputs.

Understanding Web Content Origins

In the paper titled "A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism," researchers explore the widespread use of Machine Translation (MT) across the internet, focusing on how content is often converted into multiple languages. A key finding is that a significant portion of web content, particularly in languages with fewer resources, appears to have originated from low-quality machine translations. Concerningly, such content often originates from poorly constructed English sources, which are likely translated for the sole purpose of generating online ad revenue.

Analysis of Multi-Way Translations

The paper emphasizes the implications of MT's dominance on the web by providing an extensive analysis of the multi-way parallel corpus that the team assembled, incorporating 6.4 billion unique sentences spanning 90 languages. The corpus, nicknamed Multi-Way ccMatrix (MWccMatrix), is notable for its vast scale and is made publicly available alongside tools for its reproduction. The analysis shows that as the number of languages a piece of content is translated into increases, the quality of those translations decreases - suggesting a higher reliance on MT.

Quality and Topic Distribution Concerns

The quality of web translations was evaluated using Quality Estimation metrics, pinpointing that multi-way translations are significantly lower in quality compared to bi-lingual translations. Moreover, the research uncovered that these multi-way parallel translations differ in topic distribution compared to their single-language counterparts. A closer look at topic distributions revealed a surge in topics categorized as "Conversation & Opinion," featuring shorter and more predictable sentences, often linked to low-quality content.

Implications for Multilingual Model Training

The study underlines the critical challenges faced when training multilingual LLMs with web-scraped data, which often incorporates these machine-translated contents. The research alerts to the potential for such data to negatively influence the fluency and correctness of LLMs. Additionally, it introduces multi-way parallelism as a useful metric for detecting low-quality, MT-generated content - a finding that could aid in refining data sourcing strategies for better model training outcomes.

In conclusion, the paper underscores the crucial need for enhancing data quality standards for LLM training and outlines opportunities for improvement through sophisticated web content analysis techniques.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 33 tweets with 144 likes about this paper.