Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parity-Aware BPE for Fair Multilingual Tokenization

Updated 26 January 2026
  • Parity-aware BPE (PaBPE) is a tokenization method that modifies classical BPE by focusing on the worst-compressed language to achieve equitable compression across multilingual corpora.
  • It selects merges using language-specific bigram counts, thereby reducing tokenization inefficiencies and disparities especially in low-resource languages.
  • Empirical results demonstrate that PaBPE maintains high overall compression and model performance while significantly lowering cross-lingual fairness gaps.

Parity-aware Byte-Pair Encoding (PaBPE) is a modification of the classical Byte-Pair Encoding (BPE) algorithm designed to enhance cross-lingual fairness in tokenization workflows, particularly in multilingual natural language processing tasks. By iteratively maximizing the compression gain for the currently worst-compressed language during subword vocabulary construction, PaBPE mitigates the disparities in tokenization efficiency that arise when frequency-based tokenization strategies favor high-resource languages. PaBPE yields a more equitable distribution of tokenization costs, improved fairness metrics, and maintains negligible impact on overall compression rates and downstream model performance (Foroutan et al., 6 Aug 2025).

1. Byte-Pair Encoding: Classical Foundations

Byte-Pair Encoding (BPE) is a greedy, frequency-driven algorithm for learning a fixed-size subword vocabulary that optimizes corpus-level compression. Given corpus DD, initial alphabet BB (B=256|B| = 256, e.g., bytes), and KK target merges, the algorithm starts with V0=BV_0 = B and maintains a tokenizer TkT_k defined by merges m1,,mk1m_1, \ldots, m_{k-1}. The average compression rate for DD under TkT_k is:

CR(D;Tk)=1DxDxuTk(x)CR(D;T_k) = \frac{1}{|D|} \sum_{x \in D} \frac{|x|_u}{|T_k(x)|}

where xu|x|_u is the length in bytes, chars, or another unit. Classical BPE greedily chooses at each step the pair (v,v)(v,v') with the largest bigram count in DD and merges it, maximizing immediate corpus-level compression gain:

m=argmaxm1,,mKCR(D;TK)\mathbf{m}^* = \arg\max_{m_1, \dots, m_K} CR(D;T_K)

The resulting subword vocabulary and merge history enable tokenization that efficiently compresses text but can induce significant inequity across languages in multilingual corpora.

2. Motivation and Cross-Lingual Fairness Challenges

Typical frequency-based BPE objectives heavily privilege languages contributing the most tokens, disadvantaging low-resource languages in several ways:

  • Increased tokenization length: Higher fertility (tokens per word) for low-resource languages.
  • Morphological misalignment: Fragmented or implausible subwords in minority languages.
  • Placeholder proliferation: Elevated rates of <UNK> assignment due to poor segmentation.
  • Operational costs: Users of low-resource languages pay more per API token due to less efficient tokenization.

To address this, PaBPE introduces an explicit fairness objective, aiming for per-language compression rates CR(Dl;T)CR(D_l; T) to be as equal as possible, thereby lowering the computational "tax" on under-resourced languages.

3. Mathematical Formulation of Parity-aware Objectives

3.1. Global and Per-Language Compression Gain

The global compression gain for a merge m=(v,v)m = (v, v') on corpus DD is:

Gglobal(m)=CR(D;Tm)CR(D;T)G_{\mathrm{global}}(m) = CR(D;T \oplus m) - CR(D;T)

Where TmT\oplus m is the tokenizer after applying merge mm.

3.2. Identifying the Worst-Compressed Language

Partitioning the corpus by languages L={l1,...,lR}L = \{ l_1, ..., l_R \}, the worst-off language at step kk is:

l=argminlLCR(Dl;T<k)l^* = \arg\min_{l \in L} CR(D_l ; T_{<k})

3.3. Parity-aware Min–Max Objective

The ideal (but intractable) objective would be:

maxm1,,mKminlLCR(Dl;TK)\max_{m_1, \dots, m_K} \min_{l \in L} CR(D_l; T_K)

PaBPE greedily approximates this by, at each step, focusing solely on the worst-compressed language ll^*. The selected merge is:

mk=argmax(v,v){occurrences of vv in Dl}m_k = \arg\max_{(v,v')} |\{ \text{occurrences of } v v' \text{ in } D_{l^*} \}|

Only bigram statistics from DlD_{l^*} inform the merge, and the same merge is applied corpus-wide, thus incrementally improving fairness.

4. Algorithmic Procedures and Pseudocode

The distinction between classical and Parity-aware BPE centers on bigram counting and selection scope:

Classical BPE

  • Counts bigrams over the entire corpus and chooses the most frequent pair.

Parity-aware BPE

  • For each merge:
    • Computes per-language compression rates on dev data.
    • Identifies ll^*, the language with lowest compression rate.
    • Counts bigrams only in DlD_{l^*}.
    • Selects the most frequent bigram in DlD_{l^*}.
    • Applies the chosen merge to all languages.

The pseudocode structure is as follows:

Step Classical BPE Parity-aware BPE
Bigram Counting All of DD DlD_{l^*} only
Merge Selection Largest corpus-wide bigram count Largest bigram count in DlD_{l^*}
Merge Application All of DD All of DD

PaBPE avoids aggregation across all languages during merge selection, a key difference that drives parity in compression rates.

5. Runtime, Compression, and Empirical Trade-Offs

  • Computational Overhead: At each merge, only an additional O(L)O(|L|) pass over dev data is required to compute CRlCR_l values. Overall asymptotic complexity remains unchanged versus classical BPE.
  • Global Compression Loss: In a 30-language, 128k-vocabulary BrickRed setup, classical BPE achieves CR=0.0303CR = 0.0303 tokens/char; PaBPE slightly decreases global compression to CR0.0300CR \approx 0.0300, corresponding to a 1% relative drop.
  • Downstream Model Performance: Across 13 multilingual benchmarks, PaBPE and hybrid variants match or exceed classical performance, with a median per-language accuracy change of +0.19+0.19 percentage points (±1\pm 1 standard error), and no significant performance drops observed. Empirical perplexity measures are as uniform and mean-matched as classical BPE.

6. Fairness Metrics and Experimental Results

Intrinsic parity measures applied on the parallel development set include:

  • Per-language Compression Rate (CRlCR_l)
  • Fertility: Tokens per word
  • Vocabulary Utilization: Fraction of vocabulary used per language
  • MorphScore: Morpheme alignment precision/recall
  • Fairness Gini Coefficient:

Gini=1R(R+12i=1R(R+1i)ciici)\mathrm{Gini} = \frac{1}{R} \left(R + 1 - 2 \frac{\sum_{i=1}^R (R+1-i) c_i }{\sum_i c_i}\right)

where cic_i are sorted per-language average tokens/line.

Key Results (BrickRed, 30 languages, 128k merges):

Tokenizer Compression Rate Gini Fertility
Classical 0.0303 0.064 4.260
Parity-aware 0.0300 0.011 4.204
Hybrid 0.0303 0.018 4.191
Window+Hybrid 0.0305 0.022 4.203

Parity-aware methods substantially reduce the Gini coefficient, indicating decreased disparity in per-language tokenization cost, with only minimal reduction in global compression and fertility. Vocabulary utilization analyses (Fig. 2) show reduced gaps for low/medium-resource languages without harming high-resource ones. Downstream results further confirm parity with classical baselines.

7. Broader Implications and Prospects for Extension

PaBPE represents a “drop-in” modification to classical BPE, imparting:

  • Significant reduction in cross-lingual compression gaps (Gini from 0.064 to 0.011).
  • Less than 1% reduction in global compression rates.
  • Negligible computational overhead per merge (O(L)O(|L|)).
  • No negative effect on downstream LLM performance.

Potential broader impacts include reducing the hidden “token tax” on under-resourced languages, facilitating equitable billing, and improving deployment fairness. Extensions to related subword tokenization families (WordPiece, UnigramLM) and other modalities (speech, vision) are viable directions. Further research could explore additional fairness objectives, such as semantic parity in representation.

Code for PaBPE is publicly available at https://github.com/swiss-ai/parity-aware-bpe.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parity-Aware BPE.