A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

Published 11 Jan 2024 in cs.CL and cs.AI | (2401.05749v2)

Abstract: We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual LLMs on both monolingual and bilingual data scraped from the web.

Abstract PDF HTML Upgrade to Chat

References (45)

Citations (11)

View on Semantic Scholar

Summary

The paper demonstrates that a significant share of web content results from low-quality machine translations originally generated for ad revenue.
It constructs and analyzes a 6.4 billion sentence multi-way parallel corpus, revealing that translation quality decreases as more languages are added.
The study highlights critical implications for multilingual LLM training, emphasizing data quality challenges and the need to detect low-quality MT outputs.

Understanding Web Content Origins

In the paper titled "A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism," researchers explore the widespread use of Machine Translation (MT) across the internet, focusing on how content is often converted into multiple languages. A key finding is that a significant portion of web content, particularly in languages with fewer resources, appears to have originated from low-quality machine translations. Concerningly, such content often originates from poorly constructed English sources, which are likely translated for the sole purpose of generating online ad revenue.

Analysis of Multi-Way Translations

The paper emphasizes the implications of MT's dominance on the web by providing an extensive analysis of the multi-way parallel corpus that the team assembled, incorporating 6.4 billion unique sentences spanning 90 languages. The corpus, nicknamed Multi-Way ccMatrix (MWccMatrix), is notable for its vast scale and is made publicly available alongside tools for its reproduction. The analysis shows that as the number of languages a piece of content is translated into increases, the quality of those translations decreases - suggesting a higher reliance on MT.

Quality and Topic Distribution Concerns

The quality of web translations was evaluated using Quality Estimation metrics, pinpointing that multi-way translations are significantly lower in quality compared to bi-lingual translations. Moreover, the research uncovered that these multi-way parallel translations differ in topic distribution compared to their single-language counterparts. A closer look at topic distributions revealed a surge in topics categorized as "Conversation & Opinion," featuring shorter and more predictable sentences, often linked to low-quality content.

Implications for Multilingual Model Training

The study underlines the critical challenges faced when training multilingual LLMs with web-scraped data, which often incorporates these machine-translated contents. The research alerts to the potential for such data to negatively influence the fluency and correctness of LLMs. Additionally, it introduces multi-way parallelism as a useful metric for detecting low-quality, MT-generated content - a finding that could aid in refining data sourcing strategies for better model training outcomes.

In conclusion, the paper underscores the crucial need for enhancing data quality standards for LLM training and outlines opportunities for improvement through sophisticated web content analysis techniques.

Markdown Report Issue