Papers
Topics
Authors
Recent
Search
2000 character limit reached

SuperBPE: Space Travel for Language Models

Published 17 Mar 2025 in cs.CL and cs.LG | (2503.13423v3)

Abstract: The assumption across nearly all LLM (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying only the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better LLMs overall.

Summary

  • The paper introduces SuperBPE, a novel tokenization strategy that learns ‘superword’ tokens to improve language model performance.
  • It employs a two-stage pretokenization curriculum that shifts from traditional subword to multi-word token encoding, boosting efficiency by up to 33%.
  • Extensive experiments on 30 tasks confirm a 4% average performance gain and reduced computational demands compared to standard BPE.

SuperBPE: Space Travel for LLMs

The paper "SuperBPE: Space Travel for LLMs" proposes a novel tokenization strategy called SuperBPE, which enhances the traditional Byte Pair Encoding (BPE) by incorporating a pretokenization curriculum allowing the learning of "superword" tokens. This innovation aims to improve LLM performance and encoding efficiency without modifying underlying architectures or training frameworks.

Tokenization Beyond Subwords

Tokenizers segment data into tokens that LMs interpret. Historically, BPE tokenization focuses on subword units, bound by whitespace. However, this conventional approach overlooks multi-word expressions functioning as singular semantic units and fails in languages without spaces, such as Chinese. SuperBPE extends tokenization to "superword" units that span multiple words, leveraging a two-stage pretokenization approach. Initially, SuperBPE learns subword tokens via whitespace splitting and then transitions to allowing superword tokens by disabling this constraint. Figure 1

Figure 1

Figure 1: SuperBPE tokenizers encode text much more efficiently than BPE, and the advantage intensifies with larger vocabulary sizes.

Implementation of SuperBPE

SuperBPE begins by mimicking traditional BPE in learning subword representations of tokens, maintaining a deterministic and greedy algorithmic approach. Transitioning into its novel phase, SuperBPE ceases whitespace-prevented merges, enabling the encoding of multi-word expressions as single tokens. This mechanism ensures efficient incorporation of common word sequences into the vocabulary, enhancing both encoding efficiency and comprehension.

The transition point, where subword learning ceases and superword learning commences, is a hyperparameter influencing performance. Optimal selection can significantly enhance model efficiency, as demonstrated under various vocabulary sizes and transition settings within the research.

Experiments and Results

The study investigates SuperBPE across several experiments, maintaining consistent model size, vocabulary, and training parameters to isolate the effects of tokenization. With a vocabulary size set to 200,000, SuperBPE-trained models demonstrated superior performance with a 4% average improvement across 30 diverse downstream tasks compared to traditional BPE. The superior ability to harness multi-token expressions resulted in consistent improvements in encoding efficiency, notably benefiting inference compute by 27% to 33%. Figure 2

Figure 2: Average task performance on 30 downstream tasks, evaluated every 5000 steps shows that SuperBPE outperforms BPE consistently.

Fine-Grained Analysis

While SuperBPE models achieve comparable bits-per-byte (BPB) metrics across various experimental configurations, BPB analysis shows differential effects depending on task-specific nuances and chosen model configurations. SuperBPE tokens inherently exhibit more uniform BPB distribution, with improved coherence especially as model complexity and training data size increase. Figure 3

Figure 3: Bits-per-byte of BPE and SuperBPE models during pretraining, showing slight variations in efficiency.

Encoding Efficiency

With SuperBPE, models realized an improved encoding efficiency measured in bytes-per-token over traditional BPE, mainly due to the richer vocabulary capable of capturing common multi-word expressions. As depicted in Figure 1, encoding efficiency correlates positively with vocabulary size, leading to significant reductions in token sequence lengths and associated computational demands.

Scaling Implications

The increased encoding efficiency implicates slower diminishing returns on scaling laws traditionally governing transformer model scaling, potentially allowing smaller but more efficient SuperBPE models to compete with significantly larger BPE models without equivalent increases in computational resources.

Conclusion

SuperBPE represents an advancement in tokenization effectiveness for LLMs, offering operational efficiencies and performance improvements, particularly in handling languages and contexts traditionally challenging for tokenization. Going forward, the interplay between tokenization strategies and architectural optimization remains a compelling avenue for enhancing LLM capabilities. The seamless integration of SuperBPE with existing model architectures positions it as an accessible advancement in the field of natural language processing, promising broad applicability across various linguistic and computational contexts.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper asks a simple question with a big impact: Do computers really need to split text only at spaces when they “read” and “think”? The authors introduce SuperBPE, a new way to break text into pieces (called tokens) that lets a LLM group common phrases like “by the way” or “Milky Way” as single pieces, not just single words or parts of words. This makes models faster, cheaper to run, and often smarter.

The big questions the authors asked

  • Can we make LLMs work better by letting tokens cross spaces (so one token can be several words), not just stop at word boundaries?
  • Will this “superword” tokenization make text shorter in tokens, saving compute time and memory?
  • If we keep the model size, training data, and training budget the same, does changing only the tokenizer improve real-world performance on many tasks?

How did they test their idea?

First, a quick primer:

  • Tokenization: Imagine you want to read a sentence but can only look at a few “chunks” at a time. A tokenizer decides where to cut the sentence into chunks (tokens). Traditional tokenizers mostly cut at spaces and also split long or rare words into smaller parts (subwords).
  • BPE (Byte-Pair Encoding): Think of building with LEGO. BPE starts with tiny bricks (bytes) and repeatedly snaps together the pairs of pieces that show up next to each other most often, creating bigger, more useful bricks (subwords). Classic BPE doesn’t let pieces join across spaces, so it won’t turn “by the way” into one brick.

What SuperBPE changes:

  • Two-stage learning (a simple curriculum): 1) Stage 1: Learn normal subword tokens (no crossing spaces), just like standard BPE. 2) Stage 2: Turn off the “no crossing spaces” rule, so the algorithm can also learn superword tokens that span multiple words (like “by_the_way”).
  • Why this helps: Many phrases act like single ideas. Grouping them reduces the total number of tokens and can make the model’s job easier and faster.

How they experimented:

  • They trained large transformer LLMs (about 8 billion parameters) from scratch.
  • They kept everything the same—model size, training data mix, total training compute, and vocabulary size (200,000 tokens)—and changed only how the tokenizer learns its vocabulary (BPE vs. SuperBPE).
  • They tried different “transition points” (when to switch from Stage 1 to Stage 2), such as 80k, 160k, and 180k tokens learned.

An everyday analogy: Packing a suitcase. BPE packs each word (or word piece) separately. SuperBPE lets you bundle common outfits (multi-word phrases) into a single packing cube. You fit more with fewer items to manage.

What did they find? Why it matters

Shorter tokenized text (better “encoding efficiency”):

  • With the same 200k vocabulary size, SuperBPE used up to about 33% fewer tokens to represent the same text than regular BPE.
  • In simple terms, each token carries more text on average, so the model needs fewer steps to read or generate the same passage.

Faster and cheaper at inference (when you actually use the model):

  • Because there are fewer tokens, the model does less work per response.
  • In their 8B-model tests, SuperBPE cut inference compute by about 27%–35%, depending on the setup.

Stronger performance on many tasks:

  • Across 30 evaluation tasks (covering knowledge, reasoning, reading, and more), the best SuperBPE model scored about +4.0 percentage points higher on average than the BPE baseline.
  • It won on 25 out of 30 tasks, including big gains on multiple-choice benchmarks (for example, +8.2 on MMLU).

Why it might help the model think better:

  • SuperBPE often creates tokens that are common, meaningful phrases (like “by accident,” “in the long run,” “for a living,” “by the way”).
  • The authors found SuperBPE spreads difficulty more evenly across tokens: fewer “too easy” and fewer “very hard” predictions. That may help models focus on the challenging parts that matter for real tasks.

A small caveat explained simply:

  • A metric called bits-per-byte (BPB) measures how hard the model’s predictions are, adjusted for token size. SuperBPE’s BPB was very close to BPE’s—and sometimes slightly worse—even though SuperBPE did better on tasks. Why? Because SuperBPE merges some extremely common words (like “the” or “of”) into longer tokens, removing some of the “easy wins.” At the same time, it reduces the worst-case mistakes, which likely boosts task accuracy where it counts.

What could this change in the real world?

  • Better models without changing architectures: SuperBPE is a small, drop-in tokenizer change. No need to redesign the model or training code.
  • Lower costs and energy use: Fewer tokens per input means faster, cheaper, and greener inference.
  • Smarter handling of language: Many languages don’t use spaces the way English does, and many ideas are phrases, not single words. SuperBPE naturally captures these, which can help multilingual and phrase-heavy text.
  • Practical takeaway: If you’re training or serving LLMs, switching to SuperBPE can yield both performance gains and efficiency savings.
  • Future directions: SuperBPE could be combined with other techniques (like predicting multiple tokens at once) for even bigger benefits.

In short, the paper shows that letting tokens cross spaces—so models can treat common phrases as single units—makes LLMs more efficient and often more capable, all with a simple change to how we split text.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 19 tweets with 627 likes about this paper.

Reddit

  1. SuperBPE (13 points, 0 comments)