Papers
Topics
Authors
Recent
Search
2000 character limit reached

MrT5: Dynamic Token Merging Framework

Updated 9 February 2026
  • The paper's main contribution is a dynamic token merging framework that introduces a learned deletion gate to reduce sequence length efficiently.
  • It employs a soft-to-hard deletion mechanism that minimizes computational overhead while maintaining contextual accuracy across various languages.
  • Empirical results demonstrate significant runtime improvements and robust cross-lingual performance with minimal accuracy loss on tasks like spelling correction and XNLI.

MrT5, or MergeT5, is a dynamic token merging framework designed to address efficiency bottlenecks in byte-level language modeling. Building upon the ByT5 architecture, MrT5 introduces a learned deletion mechanism that enables the model to shorten input sequence lengths within its encoder, thereby reducing computational complexity while maintaining competitive performance. This architecture natively adapts to cross-lingual and orthographic variations, offering a practical solution to the longstanding trade-offs between subword and byte-level representations in neural sequence models (Kallini et al., 2024).

1. Motivation and Background

Traditional LLMs such as T5 and its multilingual variant mT5 employ subword tokenization, which, while computationally efficient, suffers from sensitivity to character-level noise (e.g., misspellings, varied script forms) and inconsistent token compression rates across languages. Byte-level approaches, exemplified by ByT5, circumvent these issues by operating directly on raw byte sequences (vocabulary size 256). However, this leads to significantly increased sequence lengths (up to 1024 tokens), causing substantial overhead in both pre-training (≈33% slower) and inference (up to 10× slower on long inputs) due to the quadratic scaling of self-attention with sequence length.

Fixed downsampling architectures (e.g., CANINE, Charformer) introduce their own limitations, such as the loss of variable-span semantic information or necessitating architectural changes. MrT5’s innovation is the introduction of a dynamic, content-adaptive token deletion gate that operates with minimal intrusion into the original Transformer encoder structure, merging less informative or redundant bytes early in the processing pipeline.

2. Architectural Design and Token Deletion Mechanism

MrT5 is built atop the ByT5-Small backbone, which uses a 12-layer encoder and a 4-layer decoder, model dimension dmodel=1472d_{model} = 1472, and feed-forward dimension dff=3584d_{ff} = 3584. The learned token deletion mechanism is introduced at encoder layer =3\ell=3:

  • First \ell layers operate on the full sequence of NN tokens.
  • After layer \ell, a subset of tokens is retained according to a parametrized gating function, reducing the sequence to N=N(1δ)N' = N\cdot(1-\delta), where δ\delta is the fraction of tokens deleted.

During training, “soft deletion” is implemented; masked tokens remain in the model but are prevented from contributing through their attention weights. At inference time, “hard deletion” physically removes columns from the hidden states and prunes associated relative-position bias entries.

3. Learned Deletion Gate Formalism

Let HRN×dmodelH_\ell \in \mathbb{R}^{N \times d_{model}} represent the output hidden states at layer \ell. The deletion gate produces a scalar mask G(k,0)NG \in (k, 0)^N as: G=kσ(HW+1Nb)G = k \cdot \sigma(H_\ell W + 1_N b) where WRdmodel×1W \in \mathbb{R}^{d_{model} \times 1}, bias bRb \in \mathbb{R}, 1N1_N is an all-ones vector, kk is a large negative constant (k=30k=-30), and σ\sigma is the sigmoid function. This results in Gi0G_i \approx 0 for tokens to be kept, GikG_i \approx k for those to be deleted.

During training, GG is incorporated as an additive mask to self-attention logits in subsequent layers: Attention(Q,K,V)=softmax1(QKdmodel+1NG)V\text{Attention}(Q, K, V) = \mathrm{softmax}_1\left( \frac{Q K^\top}{\sqrt{d_{model}}} + 1_N G^\top \right)V where softmax1(x)i=exp(xi)/(1+jexp(xj))\mathrm{softmax}_1(x)_i = \exp(x_i) / (1 + \sum_j \exp(x_j)). During inference, positions with Gi<τG_i < \tau (τ=k/2=15\tau=k/2=-15) are physically removed. Relative-position encodings are pruned accordingly.

Unlike pooling or averaging, information from deleted tokens is merged implicitly: tokens to be deleted contribute contextually in layers before deletion, and their influence is funneled to the surviving tokens via the preceding multi-head attention computations.

4. Training Objective, Compression Trade-offs, and Hyperparameters

The MrT5 pre-training objective is the standard T5 span corruption cross-entropy loss (LCEL_{CE}), where 15% of bytes are masked and the model predicts the masked content. An additional regularization term LG=(1/N)i=1NGiL_G = (1/N)\sum_{i=1}^N G_i encourages higher deletion rates (GiG_i is negative; minimizing LGL_G pushes more scores toward kk). The total loss is: L=LCE+αLGL = L_{CE} + \alpha L_G with α\alpha weighting the deletion regularizer.

A PP-controller optionally adaptively updates α\alpha to achieve a desired deletion fraction δ^\hat{\delta}: αt+1=clamp(αt+kp(δ^δt))\alpha_{t+1} = \mathrm{clamp}(\alpha_t + k_p (\hat{\delta} - \delta_t)) (kp=106k_p=10^{-6}, clamp ensures non-negativity). This mechanism enables explicit targeting of sequence reduction goals.

Critical hyperparameters include:

  • Delete gate layer (=3\ell=3): Chosen via ablation; earlier causes excessive information loss, later gains less efficiency.
  • Gate constant k=30k=-30, threshold τ=15\tau=-15
  • Deletion regularizer weight α[0.005,0.015]\alpha \in [0.005, 0.015] (monolingual), α=0.012\alpha=0.012 (multilingual)
  • Batch size 2202^{20} tokens for pre-training, AdamW optimizer, learning rate 10410^{-4}.

5. Empirical Performance Analysis

MrT5 achieves significant improvements in sequence efficiency with minor or negligible accuracy degradation across a range of tasks and languages:

Model / Task Accuracy (%) Runtime (ms) Sequence Reduction (%)
ByT5 (English C4, δ=0) 56.27 0
MrT5 (δ≈57%) 40.78 57
ByT5 (XNLI, English) 76.47 8.95 0
MrT5 (XNLI, English) 78.88 5.55 52.6
ByT5 (Spelling Correction) 58.19 3.25 0
MrT5 (Spelling Correction) 56.07 2.18 78.9
  • Monolingual continued pre-training on English: With δ57%\delta \approx 57\%, inference runtime reduces by 27.5% with minimal effect on bits-per-byte.
  • Cross-lingual (XNLI): Zero-shot transfer to 14 languages, with average runtime and sequence length reductions exceeding 40%.
  • Character-level tasks: Spelling correction and word search with up to 78.9% sequence reduction, showing MrT5’s practicality for noisy input tasks.
  • Multilingual adaptation: A single gate learns script-specific deletion patterns, e.g., dropping spaces in Chinese for optimal compression, reaching ≈60–70% sequence reduction across 15 languages with loss Δ\Delta to ByT5 0.01\leq 0.01.

Diagnostic copy tasks demonstrate that MrT5 learns semantically meaningful deletion strategies (such as removing vowels or merging contiguous runs), further supporting the effectiveness of the learned deletion paradigm.

6. Cross-lingual Generalization and Language-specific Compression

All gate parameters are shared across languages, allowing MrT5 models trained on English to transfer deletion strategies to other Latin-script languages in a zero-shot fashion. For scripts dissimilar to those encountered in training (e.g., Chinese), language-specific continued pre-training enables adaptation of deletion behavior, leading to uniform compression rates. In the multilingual training regime, MrT5 achieves consistent sequence compression (≈63%) for both alphabetic and logographic languages, highlighting the delete gate's flexibility and the model’s robustness across diverse orthographies.

7. Limitations and Future Directions

MrT5 is limited by the need to select the deletion layer \ell and tune the gate regularizer α\alpha for each use case. Early deletion can result in loss of essential context, while late deletion gives less compute savings. The model’s implicit merging mechanism relies on the attention capacity of the encoder; it does not provide explicit control or interpretability over how information from deleted tokens is aggregated. Additionally, the evaluation is restricted to parameter-efficient regimes on ByT5-size models; scaling to larger architectures and generalization to other modalities (speech, vision) remains to be explored.

A plausible implication is that further architectural refinement, such as adaptive multi-layer merging or per-token deletion scheduling, could enhance both efficiency and language/task adaptation. The soft-to-hard deletion objective leaves open questions concerning the information bottleneck’s effect on deeper layers’ representation learning. MrT5’s paradigm provides a basis for the broader integration of dynamic sequence length optimization mechanisms into LLM architectures (Kallini et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MrT5 (MergeT5).