One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

Published 12 Jun 2025 in cs.CL | (2506.10766v1)

Abstract: Pretraining massively multilingual LLMs for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates how a universal tokenizer, trained on a wide range of 69 languages, enhances LLM adaptability with up to 19% win rate improvements.
It employs Byte Pair Encoding with a unique language weighting strategy, achieving +8x faster adaptation while preserving nearly identical accuracy.
The study highlights practical gains in multilingual LLMs, offering efficient language support even for resource-scarce and previously unseen languages.

Multilingual LLMs and Universal Tokenizers

Introduction

The paper "One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers" investigates how universal tokenizers can enhance the adaptability of large multilingual LLMs to new languages. By focusing on universally trained tokenizers, this study seeks to overcome the limitations of post-training language adaptation in LLMs related to language coverage and efficiency.

Background and Motivation

Training multilingual LLMs is challenging due to limited capacity, high computational costs, and scarcity of high-quality data for numerous languages. Previous work primarily addresses language adaptation post-pretraining through techniques like vocabulary extension or embedding retraining, which are resource-intensive. The lack of language coverage in tokenizers used during pretraining further complicates post-hoc language adaptation efforts.

Universal Tokenizer Approach

The proposed universal tokenizer strategy is central to this work. It involves training a tokenizer with data from a broader range of languages than those primarily used during pretraining. This approach aims to enhance the model’s language plasticity, allowing faster and more efficient adaptation to new languages. The core hypothesis is that a universal tokenizer can improve adaptability without degrading performance on the primary pretraining languages.

Figure 1:

Figure 1: Universal tokenizer exhibits +2x higher plasticity with +8x faster adaptation compared to the cluster-specific baseline tokenizer.

Methodology

Language Coverage and Model Training

The experiments covered 69 languages, grouped into three geographic clusters: European, Asian, and Middle-Eastern & Indic languages. Models were pretrained on primary languages from each cluster, with systematic adaptation experiments conducted on an expanded set of languages. The adaptation experiments included strategies like continued pretraining and targeted adaptation, both for languages seen in tokenizer training and those completely unseen.

Tokenizer Specifications

The universal tokenizer was trained using Byte Pair Encoding across all languages, with a vocabulary size of 250k tokens. A unique feature of the tokenizer was its language weighting strategy, balancing the natural data distribution with language grouping by script and family.

Figure 2:

Figure 2: Win rates for models trained with the Universal and Cluster tokenizers against Dolly generations.

Results

Pretraining Performance

The universal tokenizer demonstrated competitive performance with cluster-specific tokenizers on primary languages, maintaining performance differences within 0.5% average accuracy. Moreover, it showed superior plasticity on expanded languages, with up to a 19% increase in win rates during continued pretraining and a 14.6% improvement in targeted adaptations.

Figure 3:

Figure 3: Universal tokenizer shows nearly the same performance with Cluster tokenizer throughout the training.

Adaptation Efficiency

The universal tokenizer enabled up to +8x faster adaptation for expanded languages, even when minimal additional data from new languages was included during pretraining. This efficiency significantly benefits practitioners aiming to expand language coverage with limited computational resources.

Discussion

The findings suggest that using a universal tokenizer from the outset can effectively boost the adaptability of LLMs to new languages. This method contrasts with the typical post-training techniques and showcases notable improvements in adaptation speed and efficiency. The universal tokenizer also shows promise for languages previously unseen in both tokenizer and pretraining stages, leading to gains of up to 5% in win rates.

Figure 4:

Figure 4: Adaptation results per language in Euro cluster together with tokenizers' compression ratio.

Conclusion

The study concludes that universal tokenizers significantly improve multilingual plasticity in LLMs and facilitate more efficient language adaptation strategies. This capability has important implications for expanding language support in AI systems globally, enabling more equitable technology access across diverse linguistic settings. Future research might explore integrating universal tokenizers with other language adaptation techniques to further optimize performance across an even wider range of languages.