Revisiting non-English Text Simplification: A Unified Multilingual Benchmark

Published 25 May 2023 in cs.CL and cs.AI | (2305.15678v1)

Abstract: Recent advancements in high-quality, large-scale English resources have pushed the frontier of English Automatic Text Simplification (ATS) research. However, less work has been done on multilingual text simplification due to the lack of a diverse evaluation benchmark that covers complex-simple sentence pairs in many languages. This paper introduces the MultiSim benchmark, a collection of 27 resources in 12 distinct languages containing over 1.7 million complex-simple sentence pairs. This benchmark will encourage research in developing more effective multilingual text simplification models and evaluation metrics. Our experiments using MultiSim with pre-trained multilingual LLMs reveal exciting performance improvements from multilingual training in non-English settings. We observe strong performance from Russian in zero-shot cross-lingual transfer to low-resource languages. We further show that few-shot prompting with BLOOM-176b achieves comparable quality to reference simplifications outperforming fine-tuned models in most languages. We validate these findings through human evaluation.

Abstract PDF Upgrade to Chat

Citations (21)

View on Semantic Scholar

Summary

The paper introduces MULTISIM, a unified benchmark featuring 27 datasets and over 1.7 million sentence pairs for non-English text simplification.
It demonstrates that multilingual training and few-shot prompting with models like mT5 and BLOOM significantly boost performance in low-resource languages.
Experiments show that semantic similarity-based example selection outperforms random sampling, achieving near-human simplification quality across diverse languages.

Revisiting non-English Text Simplification: A Unified Multilingual Benchmark

Introduction

The paper "Revisiting non-English Text Simplification: A Unified Multilingual Benchmark" addresses a significant gap in the field of automatic text simplification (ATS) by introducing the MULTISIM benchmark. This benchmark comprises 27 datasets across 12 languages, offering a valuable resource for improving multilingual ATS models and evaluation metrics. The paper’s central focus is on enhancing text simplification for non-English languages, which has been historically under-represented compared to English.

Figure 1: Papers published each year with content related to text simplification and a specific language according to Google Scholar. The quantity of English text simplification work vastly exceeds all other languages.

The MULTISIM Benchmark

MULTISIM is significant in its scope, covering languages that vary widely in terms of resource availability. This benchmark includes over 1.7 million sentence pairs and is designed to evaluate and foster the development of robust multilingual simplification models. The paper notes the utility of multilingual pre-trained models like mT5 and BLOOM in achieving effective cross-lingual transfer and few-shot performance, even for low-resource languages.

Figure 2: Data availability for text simplification in all languages partitioned on collection strategy. Despite only including three of the most common English datasets, English resources outnumber all other language resources combined.

Experimental Insights

Experiments on the MULTISIM benchmark reveal that multilingual training significantly enhances performance in non-English text simplification tasks. The zero-shot cross-lingual transfer is particularly effective from Russian datasets to other languages, demonstrating the importance of domain and script compatibility. Few-shot prompting using large models like BLOOM also shows promise by outperforming fine-tuned models in low-resource scenarios.

Figure 3: Semantic similarity fewshot performance in low-resource languages. Fewshot prompting achieves higher SARI than mt5 finetuned.

Comparative Performance

The comparative analysis between semantic similarity-based example selection and random sampling for few-shot settings underscores a consistent advantage for semantic approaches. This superior performance manifests across different datasets, highlighting the method’s robustness and potential for practical applications.

Figure 4: Semantic similarity vs random sampling few-shot performance on four diverse datasets. Semantic similarity consistently scores above random sampling.

Human Evaluation

Human evaluation aligns with automatic metrics, with models trained on large-scale data outperforming others. Few-shot prompting shows near-human simplification levels in several languages, reinforcing its applicability across varying linguistic contexts.

Conclusion

The paper makes a substantive contribution by releasing MULTISIM, advancing the field of non-English ATS. This benchmark facilitates comprehensive multilingual model evaluation and encourages future research in enhancing simplification across diverse languages. The insights into few-shot and zero-shot learning expand possibilities for deploying text simplification technologies in languages with limited resources.

Implications and Future Directions

MULTISIM’s introduction has implications for advancing ATS models capable of handling linguistic diversity. Future work could explore realigning automated corpora for improved accuracy and extending human evaluations to ensure simplifications meet users’ comprehension needs across different demographics.

Figure 5: Distribution of document-level compression ratio for document-aligned corpora, smoothed by Gaussian kernel density estimation. Means are marked by dashed lines.

In summary, this paper lays a foundational stone for future endeavors in making complex information accessible worldwide, breaking down language barriers through innovative ATS solutions.

Markdown Report Issue