A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
Abstract: We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual LLMs on both monolingual and bilingual data scraped from the web.
- Automatic detection of machine translated text and translation quality estimation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 289–295, Baltimore, Maryland. Association for Computational Linguistics.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867.
- Yuki Arase and Ming Zhou. 2013. Machine translation detection from monolingual web-text. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1597–1607, Sofia, Bulgaria. Association for Computational Linguistics.
- Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
- ParaCrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4555–4567, Online. Association for Computational Linguistics.
- Dubbing in practice: A large scale study of human localization with insights for automatic dubbing. Transactions of the Association for Computational Linguistics, 11:419–435.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Christian Buck and Philipp Koehn. 2016. Quick and reliable document alignment via TF/IDF-weighted cosine distance. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 672–678, Berlin, Germany. Association for Computational Linguistics.
- Low-resource corpus filtering using multilingual sentence embeddings. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 261–266, Florence, Italy. Association for Computational Linguistics.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Kevin Duh. 2018. The multitarget ted talks task. http://www.cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/.
- Markus Freitag and Orhan Firat. 2020. Complete multilingual neural machine translation. In Proceedings of the Fifth Conference on Machine Translation, pages 550–560, Online. Association for Computational Linguistics.
- Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Singapore. Association for Computational Linguistics.
- Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Online. Association for Computational Linguistics.
- William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75–102.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Federico Gaspari and John Hutchins. 2007. Online and free! ten years of online machine translation: origins, developments, current use and future prospects. In Proceedings of Machine Translation Summit XI: Papers, Copenhagen, Denmark.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
- Marcin Junczys-Dowmunt. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 888–895, Belgium, Brussels. Association for Computational Linguistics.
- Machine translation approaches and survey for indian languages. arXiv preprint arXiv:1701.04290.
- Huda Khayrallah and Philipp Koehn. 2018. On the impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74–83, Melbourne, Australia. Association for Computational Linguistics.
- Findings of the WMT 2020 shared task on parallel corpus filtering and alignment. In Proceedings of the Fifth Conference on Machine Translation, pages 726–742, Online. Association for Computational Linguistics.
- Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 54–72, Florence, Italy. Association for Computational Linguistics.
- Findings of the WMT 2018 shared task on parallel corpus filtering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739, Belgium, Brussels. Association for Computational Linguistics.
- Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
- Automatic detection of translated text and its impact on machine translation. In Proceedings of Machine Translation Summit XII: Papers, Ottawa, Canada.
- What language model to train if you have one million GPU hours? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 765–782, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Understanding regional context of world wide web using common crawl corpus. In 2017 IEEE 13th Malaysia International Conference on Communications (MICC), pages 164–169.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- There’s no data like better data: Using QE metrics for MT data filtering. In Proceedings of the Eighth Conference on Machine Translation, pages 561–577, Singapore. Association for Computational Linguistics.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Philip Resnik. 1998. Parallel strands: a preliminary investigation into mining the web for bilingual text. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 72–82, Langhorne, PA, USA. Springer.
- Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29(3):349–380.
- CCMatrix: Mining billions of high-quality parallel sentences on the web. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6490–6500, Online. Association for Computational Linguistics.
- Rico Sennrich and Martin Volk. 2010. MT-based sentence alignment for OCR-generated parallel texts. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, Denver, Colorado, USA. Association for Machine Translation in the Americas.
- Findings of the WMT 2023 shared task on parallel data curation. In Proceedings of the Eighth Conference on Machine Translation, pages 95–102, Singapore. Association for Computational Linguistics.
- Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137.
- Brian Thompson and Philipp Koehn. 2019. Vecalign: Improved sentence alignment in linear time and space. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1342–1348, Hong Kong, China. Association for Computational Linguistics.
- Brian Thompson and Philipp Koehn. 2020. Exploiting sentence order in document alignment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5997–6007, Online. Association for Computational Linguistics.
- Brian Thompson and Matt Post. 2020a. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 90–121, Online. Association for Computational Linguistics.
- Brian Thompson and Matt Post. 2020b. Paraphrase generation as zero-shot multilingual translation: Disentangling semantic similarity from lexical and syntactic diversity. In Proceedings of the Fifth Conference on Machine Translation, pages 561–570, Online. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.