Parity-Aware BPE for Fair Multilingual Tokenization

Updated 26 January 2026

Parity-aware BPE (PaBPE) is a tokenization method that modifies classical BPE by focusing on the worst-compressed language to achieve equitable compression across multilingual corpora.
It selects merges using language-specific bigram counts, thereby reducing tokenization inefficiencies and disparities especially in low-resource languages.
Empirical results demonstrate that PaBPE maintains high overall compression and model performance while significantly lowering cross-lingual fairness gaps.

Parity-aware Byte-Pair Encoding (PaBPE) is a modification of the classical Byte-Pair Encoding (BPE) algorithm designed to enhance cross-lingual fairness in tokenization workflows, particularly in multilingual natural language processing tasks. By iteratively maximizing the compression gain for the currently worst-compressed language during subword vocabulary construction, PaBPE mitigates the disparities in tokenization efficiency that arise when frequency-based tokenization strategies favor high-resource languages. PaBPE yields a more equitable distribution of tokenization costs, improved fairness metrics, and maintains negligible impact on overall compression rates and downstream model performance (Foroutan et al., 6 Aug 2025).

1. Byte-Pair Encoding: Classical Foundations

Byte-Pair Encoding (BPE) is a greedy, frequency-driven algorithm for learning a fixed-size subword vocabulary that optimizes corpus-level compression. Given corpus $D$ , initial alphabet $B$ ( $|B| = 256$ , e.g., bytes), and $K$ target merges, the algorithm starts with $V_0 = B$ and maintains a tokenizer $T_k$ defined by merges $m_1, \ldots, m_{k-1}$ . The average compression rate for $D$ under $T_k$ is:

$CR(D;T_k) = \frac{1}{|D|} \sum_{x \in D} \frac{|x|_u}{|T_k(x)|}$

where $B$ 0 is the length in bytes, chars, or another unit. Classical BPE greedily chooses at each step the pair $B$ 1 with the largest bigram count in $B$ 2 and merges it, maximizing immediate corpus-level compression gain:

$B$ 3

The resulting subword vocabulary and merge history enable tokenization that efficiently compresses text but can induce significant inequity across languages in multilingual corpora.

2. Motivation and Cross-Lingual Fairness Challenges

Typical frequency-based BPE objectives heavily privilege languages contributing the most tokens, disadvantaging low-resource languages in several ways:

Increased tokenization length: Higher fertility (tokens per word) for low-resource languages.
Morphological misalignment: Fragmented or implausible subwords in minority languages.
Placeholder proliferation: Elevated rates of <UNK> assignment due to poor segmentation.
Operational costs: Users of low-resource languages pay more per API token due to less efficient tokenization.

To address this, PaBPE introduces an explicit fairness objective, aiming for per-language compression rates $B$ 4 to be as equal as possible, thereby lowering the computational "tax" on under-resourced languages.

3. Mathematical Formulation of Parity-aware Objectives

3.1. Global and Per-Language Compression Gain

The global compression gain for a merge $B$ 5 on corpus $B$ 6 is:

$B$ 7

Where $B$ 8 is the tokenizer after applying merge $B$ 9.

3.2. Identifying the Worst-Compressed Language

Partitioning the corpus by languages $|B| = 256$ 0, the worst-off language at step $|B| = 256$ 1 is:

$|B| = 256$ 2

3.3. Parity-aware Min–Max Objective

The ideal (but intractable) objective would be:

$|B| = 256$ 3

PaBPE greedily approximates this by, at each step, focusing solely on the worst-compressed language $|B| = 256$ 4. The selected merge is:

$|B| = 256$ 5

Only bigram statistics from $|B| = 256$ 6 inform the merge, and the same merge is applied corpus-wide, thus incrementally improving fairness.

4. Algorithmic Procedures and Pseudocode

The distinction between classical and Parity-aware BPE centers on bigram counting and selection scope:

Classical BPE

Counts bigrams over the entire corpus and chooses the most frequent pair.

Parity-aware BPE

For each merge:
- Computes per-language compression rates on dev data.
- Identifies $|B| = 256$ 7, the language with lowest compression rate.
- Counts bigrams only in $|B| = 256$ 8.
- Selects the most frequent bigram in $|B| = 256$ 9.
- Applies the chosen merge to all languages.

The pseudocode structure is as follows:

Step	Classical BPE	Parity-aware BPE
Bigram Counting	All of $K$ 0	$K$ 1 only
Merge Selection	Largest corpus-wide bigram count	Largest bigram count in $K$ 2
Merge Application	All of $K$ 3	All of $K$ 4

PaBPE avoids aggregation across all languages during merge selection, a key difference that drives parity in compression rates.

5. Runtime, Compression, and Empirical Trade-Offs

Computational Overhead: At each merge, only an additional $K$ 5 pass over dev data is required to compute $K$ 6 values. Overall asymptotic complexity remains unchanged versus classical BPE.
Global Compression Loss: In a 30-language, 128k-vocabulary BrickRed setup, classical BPE achieves $K$ 7 tokens/char; PaBPE slightly decreases global compression to $K$ 8, corresponding to a 1% relative drop.
Downstream Model Performance: Across 13 multilingual benchmarks, PaBPE and hybrid variants match or exceed classical performance, with a median per-language accuracy change of $K$ 9 percentage points ( $V_0 = B$ 0 standard error), and no significant performance drops observed. Empirical perplexity measures are as uniform and mean-matched as classical BPE.

6. Fairness Metrics and Experimental Results

Intrinsic parity measures applied on the parallel development set include:

Per-language Compression Rate ( $V_0 = B$ 1)
Fertility: Tokens per word
Vocabulary Utilization: Fraction of vocabulary used per language
MorphScore: Morpheme alignment precision/recall
Fairness Gini Coefficient:

$V_0 = B$ 2

where $V_0 = B$ 3 are sorted per-language average tokens/line.

Key Results (BrickRed, 30 languages, 128k merges):

Tokenizer	Compression Rate	Gini	Fertility
Classical	0.0303	0.064	4.260
Parity-aware	0.0300	0.011	4.204
Hybrid	0.0303	0.018	4.191
Window+Hybrid	0.0305	0.022	4.203

Parity-aware methods substantially reduce the Gini coefficient, indicating decreased disparity in per-language tokenization cost, with only minimal reduction in global compression and fertility. Vocabulary utilization analyses (Fig. 2) show reduced gaps for low/medium-resource languages without harming high-resource ones. Downstream results further confirm parity with classical baselines.

7. Broader Implications and Prospects for Extension

PaBPE represents a “drop-in” modification to classical BPE, imparting:

Significant reduction in cross-lingual compression gaps (Gini from 0.064 to 0.011).
Less than 1% reduction in global compression rates.
Negligible computational overhead per merge ( $V_0 = B$ 4).
No negative effect on downstream LLM performance.

Potential broader impacts include reducing the hidden “token tax” on under-resourced languages, facilitating equitable billing, and improving deployment fairness. Extensions to related subword tokenization families (WordPiece, UnigramLM) and other modalities (speech, vision) are viable directions. Further research could explore additional fairness objectives, such as semantic parity in representation.

Code for PaBPE is publicly available at https://github.com/swiss-ai/parity-aware-bpe.

Markdown Report Issue Upgrade to Chat

References (1)

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parity-Aware BPE.