Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data

Published 13 Aug 2024 in cs.CL | (2408.06537v5)

Abstract: Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT'23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT'23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM's strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel LLM-generated parallel dataset using MBR decoding and QE reranking, challenging the reliance on human-generated data.
The paper demonstrates that NMT models trained on NewsPaLM achieve higher performance than those trained on larger datasets like WMT'23.
The paper shows that self-distillation enhances the generating LLM’s performance, offering a cost-effective approach to high-quality translations.

An Overview of the NewsPaLM MBR and QE Dataset for Neural Machine Translation

The paper "Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data" presents an in-depth evaluation of a new dataset designed to improve neural machine translation (NMT) systems. The dataset, NewsPaLM, is notable for being LLM-generated and incorporating advanced decoding methods, such as Minimum Bayes Risk (MBR) decoding and Quality Estimation (QE) reranking, to produce high-quality parallel data.

Key Contributions

Dataset Release: The paper features the release of the NewsPaLM dataset, which contains sentence-level and multi-sentence parallel data generated using a LLM. This dataset is unique in using MBR decoding and QE reranking, challenging the preconception that human-generated data is superior for machine translation tasks.
Performance Evaluation: Extensive experiments show that training NMT models on the NewsPaLM dataset, despite its significantly smaller size, surpasses models trained on the extensive WMT'23 dataset. Notably, it achieves superior results compared to using even the highest quality subset of WMT'23 data.
Self-Distillation Effectiveness: The study explores the efficacy of self-distillation, where the performance of the LLM used to generate the dataset is improved by finetuning it on its own outputs. The results suggest that this process yields better results than using the LLM's native few-shot capabilities.

Methodology

The NewsPaLM dataset creation involved:

Source Data Collection: Using the Newscrawl corpus for compiling English-German and German-English source data.
Decoding Techniques: Using innovative MBR decoding for sentence-level data, and QE reranking for multi-sentence data, with BLEURT and MetricX as utility metrics, respectively.
Data Efficiency: Demonstrating that the dataset's efficiency is unmatched by simply filtering larger corpora for quality. It achieves significant performance gains over the larger, conventionally used datasets.

Implications and Future Work

Practically, the dataset's applications lie in refining NMT models, offering a path toward building more efficient models that require less compute and data size but still deliver high-quality translations. This has immense implications for settings with computational constraints. Theoretically, NewsPaLM's success in leveraging MBR and QE methods opens avenues for further research in efficient dataset construction and its intersection with knowledge distillation in AI models.

Future work may extend into exploring document-level translation using similar methodologies, iteratively enhancing dataset quality via LLM refinements, and optimizing distillation processes further. These improvements could lead towards even more reliable and context-appropriate translation systems, advancing the field of machine translation significantly.

In summary, this work not only contributes a valuable dataset to the open-source community but also provides compelling evidence of the potential for high-quality, machine-generated data to lead in the performance of NMT systems, challenging long-held assumptions about the supremacy of large-scale human-generated datasets in the domain.

Markdown Report Issue