Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unigram Language Models

Updated 30 January 2026
  • Unigram language models are probabilistic models that assign tokens independent probabilities based solely on corpus frequency counts.
  • They form a fundamental baseline in NLP, underpinning methods for subword tokenization and iterative vocabulary pruning through EM optimization.
  • Recent innovations neuralize these models by integrating empirical frequency priors into neural decoders, enhancing convergence and downstream performance.

A unigram LLM is a probabilistic model over a vocabulary of words or subword units in which the tokens are assumed to be drawn independently and identically from a single multinomial distribution. This non-contextual modeling paradigm is foundational both as a baseline in language modeling and as a building block for segmental and subword algorithms in modern natural language processing. Despite its simplicity, substantial methodological innovation has occurred in estimation strategies, smoothing, and neuralization, as well as in its integration into tokenization methods and as a structural prior within neural architectures.

1. Theoretical Foundations

Let V\mathcal{V} denote a vocabulary (either fixed or, in some formulations, open-ended) and let NN be the corpus token count. The core model specifies the probability p(w)p(w) of any token wVw\in\mathcal{V} as context-independent. For a string x1,x2,...,xNx_1, x_2, ..., x_N, the probability assigned under the unigram model is:

P(x1,,xN)=i=1Np(xi)P(x_1, \ldots, x_N) = \prod_{i=1}^N p(x_i)

Maximum likelihood estimation (MLE) from a corpus CC proceeds via frequency counts:

pMLE(w)=Count(w)Np_{\mathrm{MLE}}(w) = \frac{\mathrm{Count}(w)}{N}

This plug-in estimator is standard in both classical and neural NLP pipelines, e.g., as the default in HITgram and for the empirical distribution priors in neural generators (Dasgupta et al., 2024, Meister et al., 2022).

However, the empirical unigram faces two severe limitations: (1) it assigns zero probability to out-of-vocabulary (OOV) tokens, and (2) its sample bias inflates estimates of frequent items while under-representing rare or unobserved types, with these biases scaling with corpus size (Nikkarinen et al., 2021).

2. Unigram LM for Subword Tokenization

The unigram LM tokenization method, as implemented in SentencePiece, is a probabilistic generative model over segmentations of sentences into subword units (Bostrom et al., 2020). Let VV be a candidate subword vocabulary and θ={p(u):uV}\theta = \{p(u) : u \in V\}, with uVp(u)=1\sum_{u \in V} p(u) = 1. For a string xtx_t, all possible segmentations sSeg(xt)s \in \mathrm{Seg}(x_t) (ways to cover xtx_t with tokens from VV) are enumerated:

P(sθ)=usp(u)P(s \mid \theta) = \prod_{u \in s} p(u)

The marginal likelihood for xtx_t is:

P(xtθ)=sSeg(xt)usp(u)P(x_t| \theta) = \sum_{s \in \mathrm{Seg}(x_t)} \prod_{u \in s} p(u)

An expectation-maximization (EM) procedure is used for optimization:

  • E-step: Compute posteriors qt(s)=P(sθ(old))P(xtθ(old))q_t(s) = \frac{P(s | \theta^{(\mathrm{old})})}{P(x_t | \theta^{(\mathrm{old})})}
  • M-step: Update: pnew(u)=C(u)vVC(v)p^{\mathrm{new}}(u) = \frac{C(u)}{\sum_{v \in V} C(v)}, where C(u)C(u) is the expected total count.

To fix vocabulary size kk, an initial oversized set is pruned iteratively, measuring each candidate's removal impact on corpus perplexity and removing the least useful (highest-loss) tokens until V=k|V| = k. This global, iterative pruning, driven by a likelihood objective, contrasts sharply with greedy BPE (Bostrom et al., 2020).

At inference, the optimal segmentation ss^* is found via Viterbi-style dynamic programming:

s=argmaxsSeg(x) uslogp(u)s^{*} = \underset{s \in \mathrm{Seg}(x)}{\mathrm{argmax}} \ \sum_{u \in s} \log p(u)

This ensures segmentations align with both maximum likelihood and final vocabulary.

3. Advanced Smoothing and Priors

Smoothing is critical for robust unigram modeling. HITgram implements additive (Laplace/add-kk) smoothing so that even unseen words receive nonzero probabilities:

PLap(w)=Count(w)+αN+αVP_{\mathrm{Lap}}(w) = \frac{\mathrm{Count}(w) + \alpha}{N + \alpha |\mathcal{V}|}

For finer control, 0<α<10 < \alpha < 1 (add-kk). More sophisticated approaches, such as adapter-generator models, interpolate the empirical 1-gram with a character-level generator. Goldwater et al.'s construction (Nikkarinen et al., 2021) combines a Pitman–Yor process (PYP) smoothing prior over clusters (adaptors) with a generative character LM for OOV handling:

p(wn)=cwanwN+bsmoothed 1-gram+aK+bN+bpgen(w)backoff for OOVsp(w_n) = \underbrace{\frac{c_{w} - a n_w}{N + b}}_{\text{smoothed 1-gram}} + \underbrace{\frac{aK + b}{N + b} p_{\mathrm{gen}}(w)}_{\text{backoff for OOVs}}

This framework ensures power-law type/token distributions and addresses the empirical estimator's over-/under-confidence.

4. Extensions and Neuralization

Neuralization of the unigram LM refers to replacing the character-level generator with modern RNNs (e.g., a 3-layer LSTM), yielding a fully differentiable model that alternates cluster assignment sampling (E-step: Gibbs for clusters) and generator updating (M-step: SGD on cross-entropy) (Nikkarinen et al., 2021). This two-stage model learns to interpolate between the token-level and type-level distributions automatically, as controlled by the PYP hyperparameters.

Empirically, such models consistently outperform both token- and type-LSTM baselines in held-out cross-entropy across diverse languages, accurately handle the frequency spectrum (stratified by rank), and generalize to OOV types via the generator component.

5. Integration in Neural Language Generation and Pretraining

Neural language generators (e.g., sequence-to-sequence Transformers) inherently drift towards learning unigram distributions in early training, prior to acquiring semantic or syntactic knowledge (Meister et al., 2022). Initializing the decoder’s output bias bb as logpunigram\log p_{\mathrm{unigram}} (i.e., logcount(w)N\log\,\frac{\mathrm{count}(w)}{N} for each ww) encodes the empirical frequency prior directly in the softmax layer:

bi=log(count(wi)N)b_i = \log \left( \frac{\mathrm{count}(w_i)}{N} \right)

This initialization may accelerate convergence and enables the context encoder to specialize in nonfrequency aspects. Across machine translation benchmarks, initializations with this log-unigram prior yield an early BLEU AUC improvement of 2–5 points and small improvements in final BLEU scores (Meister et al., 2022). The effect persists across languages and under various data regimes.

Ablations reveal that (1) the bias remains stable throughout training (KL divergence stays near zero), and (2) removing the bias at evaluation disentangles context from frequency, a property less evident in randomly-initialized models. A plausible implication is that neural decoders with unigram-initialized bias more effectively leverage model capacity for modeling token interdependencies.

6. Practical Implementations and Algorithmic Considerations

HITgram demonstrates that unigram LLMs can be efficiently implemented as hash maps for token count storage, with linear corpus scaling and incremental updates via merges (Dasgupta et al., 2024):

  • Maximum tokenization throughput: 50,000 tokens/sec on standard hardware.
  • Corpus management supports incremental learning, data replacement, and low-frequency pruning for memory conservation.
  • Smoothing and context-sensitive weighting (e.g., w(w)=log(1+Count(w))w(w) = \log(1+\mathrm{Count}(w))) are supported, the latter stabilizing the influence of highly frequent tokens.

For practical applications—autocomplete, word clouds, pedagogical demos—the model’s simplicity and efficiency are key strengths. Limitations include the incapacity to model token order and syntactic dependencies, as well as scaling considerations for rapidly-growing vocabularies.

7. Empirical Comparison and Impact

The unigram LM tokenization method yields subword vocabularies with superior morphological alignment relative to BPE. Empirical results (Bostrom et al., 2020) show, on 20k-vocabulary settings:

  • English (vs. gold-standard CELEX): Unigram LM F1 30.3%30.3\%, BPE F1 19.3%19.3\%.
  • Japanese (vs. MeCab): Unigram LM F1 77.2%77.2\%, BPE F1 73.8%73.8\%.

Pretrained masked LLMs using unigram LM segmentation achieve equal or superior downstream performance (e.g., SQuAD 1.1 EM, MNLI-m, Japanese TyDi QA) compared to otherwise-identical BPE-based models.

Illustrative segmentations confirm that the unigram LM more frequently preserves morphological stems and affixes, avoiding the "dead-zone" intermediate tokens typical of BPE.

Metric / Task BPE Unigram LM
English CELEX F1 19.3% 30.3%
Japanese MeCab F1 73.8% 77.2%
SQuAD 1.1 EM 80.6 81.8
Japanese TyDi QA F1 42.1 54.4

A plausible implication is that unigram LM approaches not only offer theoretical advantages in vocabularization but also translate into tangible improvements in LLM pretraining and downstream efficacy.


In summary, unigram LLMs are foundational in both theoretical and applied NLP, with relevance extending from simple empirical estimation to advanced neural inference and subword modeling strategies. They offer mathematically precise estimation, direct interpretability, compatibility with EM and Bayesian smoothing, and robust performance within broader neural and non-neural systems (Bostrom et al., 2020, Dasgupta et al., 2024, Meister et al., 2022, Nikkarinen et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Unigram Language Models.