Dynamic Tokenization Strategies

Updated 26 January 2026

Dynamic tokenization strategies are adaptive frameworks that determine token boundaries and granularity based on data content, task requirements, or learned parameters.
They enhance model performance by mitigating limitations of static tokenization, reducing over-fragmentation, and aligning processing with downstream objectives.
These techniques span modalities such as language, vision, genomics, and time series, utilizing learnable predictors, adaptive masking, and on-the-fly merging.

Dynamic tokenization strategies are computational frameworks and learning algorithms that allow neural models to determine token boundaries, granularity, and encodings adaptively, either at training, inference, or both. These strategies have emerged as key components across modalities—including language, vision, biological sequence modeling, time series, electronic health records, and graph data—where fixed, static tokenization leads to suboptimal trade-offs between modeling capacity, efficiency, and generalization. Recent research establishes that dynamic tokenization not only improves downstream performance, but also mitigates the representational and reasoning bottlenecks introduced by traditional, static token vocabulary and segmentation schemes.

1. Principles and Taxonomy of Dynamic Tokenization

Dynamic tokenization encompasses a family of techniques for segmenting input data into tokens of variable granularity or sequence length, with token boundaries and/or compression rates determined by data content, task requirements, or learned model parameters. Key dimensions include:

Boundary prediction and variable-length segmentation: Token boundaries predicted by a learnable module, typically with differentiable relaxations for gradient propagation (Owodunni et al., 17 Jul 2025).
On-the-fly subword merging or splitting: Tokenization decisions taken per batch, sequence, or position, often using local heuristics (e.g., adjacency statistics) or neural parameterizations (Feher et al., 2024, Li et al., 17 Nov 2025, Jin et al., 2023).
Adaptive masking, pruning, or merging for efficiency: Sequence length reduction via content-aware dropout or merging of redundant tokens (Yan et al., 2024, Havtorn et al., 2023, Li et al., 17 Nov 2025).
Joint optimization with downstream tasks: Integrated training of the tokenizer and the main model to align tokenization with downstream loss signals (Hiraoka et al., 2021, Jin et al., 2023).
Multimodal and structure-aware tokenization: Domain-specific frameworks that exploit data structure, temporal signals, or multimodal cues to guide tokenization (dynamic blockwise for video, event-based for EHR, patches for dynamic graphs) (Yan et al., 2024, Ma et al., 2024, Biparva et al., 2024).

By allowing tokenization to adapt in response to observed data complexity and modeling objectives, dynamic schemes move beyond the “pipeline” paradigm in which tokenization is a fixed preprocessing step divorced from downstream task optimality.

2. Methodologies Across Modalities

Language Modeling

FLEXITOKENS (Owodunni et al., 17 Jul 2025): A byte-level LLM with a learnable boundary predictor, implemented as a small MLP over byte embeddings. Boundary decisions are made per position via the Hard Gumbel-Sigmoid trick, permitting variable-length segment emission while maintaining differentiability. Training introduces a hinge-style boundary loss to enable flexible adaptation of compression rates, eschewing the rigidity of binomial losses used in prior works.
Retrofitting LMs with Dynamic Tokenization (Feher et al., 2024): Batch-level subword merging starting from static BPE, adaptively applying the m most frequent pair merges per batch. Newly formed tokens receive embeddings from a pretrained hypernetwork (ZeTT), thus obviating the fixed vocabulary.
Task-guided dynamic selection: For morphologically rich languages (e.g., Korean), hybrid tokenization (morphological segmentation followed by BPE) is dynamically selected for classification and translation tasks, while coarse-grained BPE alone is preferred for span-centric tasks (MRC) (Park et al., 2020).

Vision and Multimodal Models

ElasticTok (Yan et al., 2024): For vision models, the number of latent tokens emitted by an autoencoder per image or video block is randomized and adaptively masked during training. At inference, token count is selected per block based on reconstruction error, yielding variable-length encodings tightly matched to data complexity.
MSViT (Havtorn et al., 2023): A gating module predicts per-region granularity, enabling mixed-scale token sets across image regions. The GBaS regularizer ensures both global control of token budget and local nontriviality in token assignment.
Dynamic Discrete Visual Tokenization (Jin et al., 2023): A Gumbel-Softmax-based selector learns to retain or merge image patch embeddings, dynamically controlling attention cost within a vision-language generative LLM.

Genomics and Sequences

MergeDNA (Li et al., 17 Nov 2025): Differentiable local-window token merging layers recurrently combine adjacent base tokens in DNA, driven by a learnable grouping function and local attention. Compression ratios are varied at each layer, producing variable-length, context-aware “words” that adapt to regional information density.

Time Series, EHR, and Graphs

EHR dynamic tokenization (Ma et al., 2024): Each event (timestamp, variable, value) is mapped into a token using learnable encoders for timing, absolute/relative position, and variable identity. Tokenization reflects the native, irregular timing of real-world event data rather than imposed binning.
Todyformer for dynamic graphs (Biparva et al., 2024): Continuous-time dynamic graphs are patchified into temporal windows, within which a structure-aware MPNN learns token representations. This dynamic patch-level tokenization alleviates over-squashing and over-smoothing, and the token set evolves as new edges/nodes arrive.

3. Mathematical Formulations and Optimization Kernels

Dynamic tokenization strategies are underpinned by probabilistic or neural mechanisms for boundary detection, merging, masking, or selection. Canonical mathematical elements include:

Boundary prediction (Owodunni et al., 17 Jul 2025):

$p_t = \sigma(W_2 \cdot \mathrm{ReLU}(W_1 h_t + b_1) + b_2)$

with discrete boundary $b_t$ obtained via Hard Gumbel-Sigmoid relaxation.

Adaptive masking and token dropout (Yan et al., 2024):

$\ell \sim U(\{M_{\min},\dots,M_{\max}\}),\qquad m[i]=1_{i \leq \ell}$

$z = E(x),\quad z_m = z \odot m,\quad \hat{x} = D(z_m)$

Token merging via learnable grouping (Li et al., 17 Nov 2025):

$s_{i,j} = \phi(h_i)^\top \phi(h_j)$

where high-scoring pairs are merged, source indices are tracked, and merging is differentiable.

Hybrid token selection functions (Park et al., 2020): Morphology-aware pipeline formalized as:

$f_{\mathrm{hybrid}}(S) = \mathrm{BPE}(f_{\mathrm{morph}}(S))$

ensuring no subword crosses a morpheme boundary.

Loss functions:
- Hinge-style boundary penalties (Owodunni et al., 17 Jul 2025):
$L_{\mathrm{boundary}} = \max(k - B \cdot N, 0),\qquad B = \alpha - A \cdot \sigma$ - Reconstruction losses over masked or merged tokenizations (Yan et al., 2024, Li et al., 17 Nov 2025, Jin et al., 2023):

$\mathcal{L}_{MTR} = -\frac{1}{N} \sum_{i=1}^N \log P(\hat{X}_i = x_i)$

used both for autoencoding and for masked token modeling.

4. Empirical Evidence and Performance Impact

Dynamic tokenization strategies have demonstrated consistent and sometimes dramatic improvements in efficiency and performance across benchmarks and modalities.

Method	Domain	Token Count Reduction	Performance Impact	Citation
FLEXITOKENS	Multilingual LM	20–50% vs BPE	Up to +10% downstream F1	(Owodunni et al., 17 Jul 2025)
ElasticTok	Image/Video	3.5–5× vs fixed	No drop/increased accuracy	(Yan et al., 2024)
Retrofitting	Multilingual LM	20–40%	<2pp drop in accuracy	(Feher et al., 2024)
MergeDNA	Genomics	~4× compression	+2–8pp on DNA/protein	(Li et al., 17 Nov 2025)
MSViT	Vision	5–30%	Equal/better top-1 acc	(Havtorn et al., 2023)
LaVIT	Vision-Language	64% attention cost	+1–3pp on VL retrieval	(Jin et al., 2023)
Korean Hybrid	NLP (Korean)	OOV↓, BLEU↑	+1–2 BLEU on MT, +1–3 F1	(Park et al., 2020)

Empirical ablations indicate that dynamic schemes both reduce over-fragmentation (token redundancies) and allow models to allocate capacity in high-complexity regions (e.g., image edges, genomic motifs), directly boosting accuracy and efficiency. Batch-level adaptive merging, even without model retraining, yields ∼1.7× attention speedup with negligible (<2%) loss on typical downstream tasks (Feher et al., 2024).

5. Limitations, Constraints, and Systemic Challenges

Vocabulary expansion and embedding assignment: Dynamic merging can introduce unbounded new token types; on-the-fly embedding generation via trained hypernetworks is required, but may slightly underperform original static embeddings (Feher et al., 2024).
Quadratic cost for fine-grained atomic splitting: In dynamic symbolic reasoning, enforcing full atomic granularity can inflate sequence lengths and attention cost unless carefully restricted (Zhang et al., 20 May 2025).
Inference–generation mismatch: Fully dynamic batch-level tokenization complicates autoregressive generation, sometimes requiring fallback to static generation or approximate nearest neighbor vocabulary searches (Feher et al., 2024).
Control of compression and fragmentation: Boundary rate parameters and losses require calibration, especially in multilingual or out-of-distribution adaptation; overly aggressive compression can degrade representation quality (Owodunni et al., 17 Jul 2025).
Integration with backbone architectures: Tokenizer design must align with positional embeddings, attention mechanisms, and domain-specific decoders to preserve invertibility and output fidelity.

6. Domain-Specific Extensions and Best Practices

Task-conditional selection: Select tokenization pipelines (e.g., pure BPE, morphological hybrid) depending on task (span vs. classification), language morphology, and data resource availability (Park et al., 2020).
Adaptive boundary prediction: Use learnable predictors with hinge-style losses; calibrate per domain/language for OOD generalization (Owodunni et al., 17 Jul 2025).
Differentiable token merging for sequences: In highly variable or motif-rich modalities (DNA, speech), stack local merging blocks to yield flexible ‘words’ (Li et al., 17 Nov 2025).
Dynamic masking for fine–coarse trade-offs: Randomized masking or merging during training induces continuum of granularity, promoting robust performance across complexity regimes (Yan et al., 2024).
Structure- and time-aware patchifying: For time-evolving graphs or EHR, patchify streams by event or time windows, encode contextual and temporal structure into tokens (Biparva et al., 2024, Ma et al., 2024).
Fairness and multilinguality: Retrofitting LMs with batch-dynamic tokenization mitigates static-tokenizer biases, equalizing length and compute across morphologically diverse languages (Feher et al., 2024, Owodunni et al., 17 Jul 2025).

7. Theoretical Foundations and Future Directions

Dynamic tokenization is underpinned by an expressivity–efficiency tradeoff: finer tokens maximize atomistic expressiveness and task-aligned reasoning, but induce quadratic compute costs and challenge backbone architectures. Coarser or adaptive tokenization mitigates these costs yet risks information-hiding or symbolic bottlenecks, especially in reasoning or mathematical tasks (Zhang et al., 20 May 2025, Singh et al., 2024). Ongoing research integrates token-awareness metrics, per-task dynamic adaptation, and online learning of boundary predictors. Promising directions include joint optimization of tokenizer and backbone, embedding hypernetworks with enhanced generalization, dynamic tokenization in streaming and RL-agent settings, and variable-rate communication in distributed and federated training scenarios.

Dynamic tokenization strategies, by decoupling tokenization from static preprocessing and aligning it with domain structure, data complexity, and task requirements, offer a pathway to more adaptable, efficient, and generalizable neural models across modalities (Feher et al., 2024, Owodunni et al., 17 Jul 2025, Yan et al., 2024, Jin et al., 2023, Li et al., 17 Nov 2025, Havtorn et al., 2023, Ma et al., 2024, Biparva et al., 2024, Park et al., 2020, Zhang et al., 20 May 2025, Singh et al., 2024).