When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

Published 24 Oct 2020 in cs.CL | (2010.12858v2)

Abstract: Transfer learning based on pretraining LLMs on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual LLM and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. Transliterating those languages improves very significantly the ability of large-scale multilingual LLMs on downstream tasks.

Abstract PDF Upgrade to Chat

Citations (160)

View on Semantic Scholar

Summary

The paper presents a taxonomy that categorizes unseen languages into easy, intermediate, and hard groups based on script similarity and transfer performance.
It shows that unsupervised MLM tuning boosts results for intermediate languages like Maltese and Bambara.
The study reveals transliteration as a practical strategy to enhance cross-lingual transfer for hard languages with dissimilar scripts, such as Sorani Kurdish and Uyghur.

Overview of Handling New Languages with Multilingual LLMs

The paper "When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual LLMs" investigates the application of multilingual LLMs, such as mBERT and XLM-R, to languages that are markedly underrepresented in NLP resources, particularly examining the factors that govern successful transfer learning. The research contrasts monolingual and multilingual models' performance on unseen languages that are not included in pretraining datasets.

The study scrutinizes the diverse behaviors of LLMs on these unseen languages and delineates a taxonomy of these languages into three categories based on performance: Easy, Intermediate, and Hard. Each category reflects distinct levels of difficulty for a model to achieve competitive results in various NLP tasks.

Easy, Intermediate, and Hard Languages

Easy Languages: These are languages that, despite not being seen in the LLM's pretraining data, exhibit strong zero-shot performance. This occurs when the languages are related to seen languages in both language family and script. An example is Faroese, where mBERT achieves high performance metrics akin to those of high-resource languages.
Intermediate Languages: These languages require additional unsupervised MLM-tuning on available raw data to outperform baseline models. Languages like Maltese and Bambara are representative of this category. They show improved performance when the multilingual model is adapted to the specific linguistic data of the target language.
Hard Languages: These pose significant challenges due to factors like differing scripts from related languages used during pretraining. For example, Sorani Kurdish, which uses the Arabic script, and Uyghur exhibit performance lagging behind strong non-contextual baselines, even after MLM-tuning.

Importance of Script in Transfer Learning

The study highlights the critical role of script in the transfer learning capabilities of multilingual models. It argues that cross-lingual transfer is notably hindered when the unseen language is in a script that differs from related languages present in the pretraining set. The researchers demonstrated that transliterating languages like Sorani and Uyghur to Latin script—aligned with that of seen languages such as Turkish—resulted in performance boosts, thus underscoring the importance of script alignment in enhancing model performance.

Practical Implications

The practical implications of these findings suggest that transliteration could serve as a pivotal step towards extending the utility of existing large-scale multilingual models to underrepresented languages. Transliteration effectively bridges the script gap, allowing the model to leverage learned representations from related languages more effectively.

Speculations on Future Developments

A potential future direction would involve further understanding and exploiting the interaction between script, language family, and model architecture. Particularly, improved script-processing techniques or model components explicitly designed to handle diverse scripts could enhance the generalization capabilities of LLMs across typologically diverse languages.

In summary, this research provides valuable insights into the performance of multilingual models on unseen languages and proposes transliteration as a viable strategy to enhance cross-lingual transfer for hard-to-transfer languages. Such advancements could play a crucial role in democratizing access to NLP capabilities for a broader spectrum of the world's languages.