Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Published 20 May 2023 in cs.CL | (2305.12182v2)

Abstract: The NLP community has mainly focused on scaling LLMs vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, "help" from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should not limit NLP to a small fraction of the world's languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.

Abstract PDF Upgrade to Chat

Citations (80)

View on Semantic Scholar

Summary

The paper introduces Glot500-m, which scales multilingual language models to 511 languages by focusing on horizontal data expansion.
It develops a novel corpus, Glot500-c, that aggregates 700GB of multilingual data from 150 sources with rigorous cleaning and filtering.
Evaluations demonstrate significant performance gains for low-resource languages, bridging the digital divide in language technology.

Scaling Multilingual LLMs with Glot500: An Expansion to 511 Languages

The paper "Glot500: Scaling Multilingual Corpora and LLMs to 500 Languages" presents a novel approach to broadening the scope of multilingual LLMs by developing Glot500-m, an LLM that encompasses 511 languages, most of which are low-resource or underrepresented. This advancement signifies a departure from the conventional trajectory of enhancing LLMs via vertical scaling, which focuses on improving model competencies and resources allocation towards a limited set of high-resource languages. Instead, Glot500 emphasizes horizontal scaling, thus addressing the pressing need to extend NLP capabilities across a much wider array of languages worldwide.

Methodology and Dataset Collection

The creation of Glot500 includes the meticulous development of Glot500-c, a multilingual corpus specifically tailored to support the training of an LLM like Glot500-m. The corpus spans over 511 languages and draws from an extensive array of approximately 150 data sources, aggregating around 700GB of multilingual data. This diverse corpus formation incorporates both high-quality sources, such as linguist-verified translations, and less curated data from web crawls. Importantly, the dataset underwent a rigorous cleaning process, including the implementation of both sentence-level and corpus-level filters to minimize noise and ensure the integrity of data used for training.

The unique assembly methodology for Glot500-c involves identifying and organizing languages based on distinct scripts and using various corpora, to balance data distribution among both high-resource (head) and low-resource (tail) languages. The idea is to exceed a minimum threshold of 30,000 sentences for any language-script to ensure its inclusion in the training dataset Glot500-c.

Model Training and Evaluation

Glot500-m represents an augmented adaptation of XLM-R (base variant), where significant extensions were made, particularly concerning vocabulary. It integrates 151,000 new tokens into its architecture, prioritizing representation for languages previously unsupported in comparable models. The continued pretraining on Glot500-c incorporates strategies like vocabulary extension and multilingual pretraining to optimize the token representation commensurate with the corpus scale.

Evaluation of Glot500-m reveals significant improvements over the baselines - XLM-R base (XLM-R-B) and XLM-R large (XLM-R-L). Across six tasks, including pseudoperplexity and roundtrip alignment among others, notable enhancements are evident for tail languages, reflecting the profound potential of Glot500-m in addressing previously underserved linguistic demographics. A pivotal finding is Glot500-m's ability to not only match but often exceed the performance on head languages despite the expanded language inclusion, suggesting synergistic benefits from enriched multilingual contexts.

Implications and Future Directions

The implications of this research are expansive, particularly in promoting language technology equity. By considerably increasing the number of languages and scripts supported by LLMs, Glot500-m acts as a bridge over the digital divide that linguistically marginalized communities face in accessing language technology.

Future developments could include exploring the impacts of model size variations and distillation techniques to distill knowledge from ‘multi-language’ models to more compact variants, facilitating easier deployment in resource-restricted settings. Furthermore, advancing methods of integrating parallel corpora to bolster machine translation capabilities for low-resource languages represents another promising avenue.

Glot500-m establishes itself as a critical milestone in the ongoing effort to democratize NLP resources globally, ensuring that a more inclusive range of languages benefit from AI advancements. This work enables the NLP community to take profound steps toward supporting linguistic diversity, thus aligning technological interventions with a more globally inclusive agenda.