Grapheme Pair Encoding (GPE) Overview

Updated 26 January 2026

Grapheme Pair Encoding is a subword tokenization strategy that uses Unicode grapheme clusters as atomic units to better represent complex scripts.
It refines the traditional BPE algorithm by first extracting linguistically motivated graphemes and then iteratively merging the most frequent adjacent pairs.
Experimental comparisons reveal that GPE yields higher compression ratios, more compact token sequences, and improved tokenization parity for abugida languages like Tamil, Sinhala, and Hindi.

Grapheme Pair Encoding (GPE) is a subword tokenization strategy designed to enhance the representation of complex script languages within LLMs by aligning token boundaries with human-perceived characters ("graphemes"), rather than arbitrary byte or codepoint boundaries. This approach seeks to address inefficiencies inherent to existing tokenization schemes—particularly in morphologically rich or abugida scripts—by incorporating a linguistically motivated atomic unit into the classical Byte Pair Encoding (BPE) framework (Velayuthan et al., 2024).

1. Algorithmic Foundation of GPE

Grapheme Pair Encoding (GPE) is explicitly formulated as an augmentation of the standard Byte Pair Encoding (BPE) algorithm. In classical BPE, the tokenization procedure recursively merges the most frequent adjacent pairs from a sequence of atomic units, where the atom is typically a byte or Unicode codepoint. GPE modifies the initialization step: the atomic units are grapheme clusters, which are (typically multi-codepoint) sequences corresponding to human-perceived characters in the relevant writing system.

Formally, for a training corpus $D = \{\text{line}_1, \text{line}_2, \ldots, \text{line}_N\}$ and a pre-tokenization rule $RE$ , GPE consists of the following two phases:

Phase 1 (Grapheme Extraction):

Iterate over all pre-tokens (obtained by $RE$ ) and decompose each into Unicode grapheme clusters (e.g., via the Python grapheme library). Aggregate all unique graphemes to seed the vocabulary.

Phase 2 (Iterative Pair Merging):

Apply the standard BPE loop: identify the most frequent adjacent grapheme pairs, merge them into new symbols, and continue until the desired vocabulary size $V_\text{target}$ is reached or no further merges are possible.

Pseudocode abstraction:

Input: corpus D, regex RE, vocab size V_target
Output: Vocabulary V, merge operations M

V, M, ghs = set(), [], set()
for line in D:
    pre_tokens = RE.split(line)
    for pt in pre_tokens:
        clusters = UnicodeGraphemeCluster(pt)
        ghs.update(clusters)
V.update(ghs)
while len(V) < V_target:
    freq = Counter()
    for line in D:
        pre_tokens = RE.split(line)
        for pt in pre_tokens:
            symbols = UnicodeGraphemeCluster(pt)
            for i in range(len(symbols) - 1):
                freq[(symbols[i], symbols[i+1])] += 1
    if not freq: break
    best_pair = freq.most_common(1)[0][0]
    # Merge best_pair in all symbol sequences, update V and M
return V, M

At each merge iteration, the pair

(a, b)

maximizing the sum over indicator occurrences of adjacent

(x, y)

is selected, preserving the BPE optimization criterion, but with grapheme-level granularity.

2. Pre-tokenization and Grapheme Extraction Procedures

Pre-tokenization in GPE, as with other subword approaches, involves splitting raw text using a regular expression $RE$ (often whitespace or punctuation). For each resulting pre-token, grapheme segmentation is applied to produce a sequence of grapheme clusters.

Example (Tamil: "வரControlled!"):

Pre-tokenize: ["வரControlled", "!"]
Grapheme clustering:
- "வரControlled" → ["வ", "ர", "C", "o", "n", "t", "r", "o", "l", "l", "e", "d"]
- "!" → ["!"]
Each extracted cluster is added to the initial grapheme set.
After the full corpus is processed, the main GPE loop (pair merging) operates over these grapheme sequences.

This process directly exposes script-specific, linguistically motivated units to the subword vocabulary, a contrast to byte- or codepoint-level schemes which frequently fragment such units in complex scripts.

3. Comparison: Byte-level BPE versus Grapheme-based GPE

A central motivation for GPE is the mitigation of inefficiencies in existing BPE implementations when applied to complex scripts:

Aspect	Byte-level BPE	Grapheme Pair Encoding
Atomic unit	Byte or codepoint	Grapheme cluster
Vocabulary seed	256 bytes / all codepoints	All observed graphemes
Token granularity	Fragments complex graphemes	Aligns with linguistic characters
Language coverage	Universal, language-agnostic	Requires grapheme segmenter
Token sequence length	Long for multibyte scripts	Shorter for complex scripts
Preprocessing	Minimal	Segmentation + Unicode handling

Byte-level approaches (e.g., ByT5, CANINE) offer maximal coverage but often at the expense of generating longer token sequences, particularly for scripts with multi-codepoint graphemes. In contrast, GPE yields more directly interpretable and compact tokenization for abugida and similar scripts, reducing token counts and aligning with user-perceived character boundaries. However, the efficacy of GPE is contingent upon correct Unicode grapheme segmentation.

4. Experimental Setup and Quantitative Results

GPE was evaluated on three Indian abugida scripts (Tamil, Sinhala, Hindi) using 150,000 sentences sampled from Samanantar for training and FLORES+ dev sets for testing. All tokenizers were configured with a vocabulary size of 5,000 tokens, and pre-tokenization was standardized via whitespace splitting for direct comparison.

Key metrics:

Compression Ratio (CR):

$\mathrm{CR} = \frac{\text{original sequence length}}{\text{tokenized sequence length}}$

Higher CR indicates better compression.

Tokenization Parity (TP) with English:

$\mathrm{TP} = \frac{|t(\text{non-en})|}{|t(\text{en})|}$

TP close to 1 denotes parity with English in token counts.

Empirical results yielded the following:

Tokenizer Variant	Tamil CR	Notes
BPE	4.32
Unigram	4.31
WordPiece	4.12
GPE	4.36	Highest CR among subword methods
CANINE (UTF-32)	0.98–0.99	Near-parity, little/no compression
ByT5 (UTF-8)	0.37–0.39	Very inefficient for multibyte
Grapheme extractor	1.41–1.55	Outperforms both ByT5 and CANINE

Further analysis demonstrated:

CR for vanilla subword algorithms with GPT-2 pre-tokenizer was ≈ 1.36; with whitespace pre-tokenizer, CR rose to 4.1–4.3.
Multilingual EC pre-tokenizers achieved CR up to ≈ 9.2 on Tamil.
Even without merging, grapheme extraction outperformed both byte-level and codepoint-level tokenizers.

This suggests that while subword merging algorithms have a limited impact, the granularity set by pre-tokenization and grapheme extraction is the primary determinant of tokenization efficiency.

5. Theoretical and Practical Trade-offs

GPE’s theoretical appeal lies in its linguistic alignment; it compacts text more efficiently by treating surface-level characters as indivisible units. Practically, this reduces token sequence lengths for complex scripts, yielding benefits in context length handling and possible fairness in token-based billing models. However, these benefits are reliant on the quality and coverage of the grapheme segmentation. Erroneous Unicode handling or unconventional script use can break the pipeline. GPE is less universally applicable to languages with very large grapheme sets (e.g., Chinese), or for abjad systems (e.g., Arabic, Hebrew) where diacritic treatment remains an open question (Velayuthan et al., 2024).

6. Limitations and Directions for Future Work

The gains in compression ratio from GPE over BPE, with all else equal, are modest (absolute improvements ≈ 0.04 in CR for Tamil). Downstream impacts on model perplexity, translation quality, or task accuracy remain to be systematically established. Initial vocabulary size may be inflated in highly inflected or morphologically rich corpora. The requirement for robust Unicode-compliant grapheme cluster libraries adds implementation complexity and may introduce platform-specific coverage gaps.

Recommended future research includes:

Extension of grapheme-based seeding to Unigram and WordPiece algorithms, coupled with downstream task evaluation (perplexity, BLEU, retrieval accuracy).
Integration of morphological analyzers to enable morpheme-aware grapheme tokenization.
Development of hybrid or adaptive vocabulary schemes (per language or domain).
Quantitative analyses of GPE’s real-world impacts on model latency, memory footprint, and cross-linguistic model fairness, including token-billing parity evaluations (Velayuthan et al., 2024).

7. Significance in the Context of LLM Development

GPE constitutes a straightforward, BPE-compatible refinement that systematically improves tokenization parity and sequence compactness for complex scripts, especially abugidas such as Tamil, Sinhala, and Hindi. The results establish that pre-tokenizer and initial atomic unit selection are the principal design decisions governing tokenization efficiency; all further gains from merge algorithms are secondary. By foregrounding grapheme clusters as foundational units, GPE provides a valuable axis of egalitarian improvement in multilingual language modeling, particularly as LLM deployment expands beyond English-centric contexts (Velayuthan et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Egalitarian Language Representation in Language Models: It All Begins with Tokenizers (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grapheme Pair Encoding (GPE).