Lexicon-Guided Subword Decoding

Updated 2 February 2026

Lexicon-guided subword decoding is a method that constrains token generation using explicit linguistic, phonological, or morphological lexica to ensure valid subword sequences.
It integrates finite-state transducers, prefix tries, and beam search constraints to restrict decoding hypotheses to lexicon-admissible tokenizations.
Empirical outcomes show reduced word error rates in ASR and improved performance in morphologically complex, high-OOV language settings.

Lexicon-guided subword decoding is a paradigm in token-level sequence decoding that constrains subword token generation or segmentation by explicit reference to a lexicon or grammar, potentially integrated at multiple stages—in the tokenization algorithm, search graph, or decoder scoring function. This approach leverages linguistic, phonological, or morphological lexica to guide the decomposition and recomposition of words into subwords, ensuring that decoding hypotheses are restricted to linguistically, phonetically, or morphologically admissible subword sequences. Lexicon-guided methods are especially prominent in automatic speech recognition (ASR), language modeling, and sequence-to-sequence modeling for morphologically complex or OOV-rich languages, and are increasingly used in constrained text generation and summarization.

1. Foundations and Motivation

Subword modeling addresses fundamental limits in word- or character-level modeling, particularly the open vocabulary problem and the challenges posed by rare or morphologically complex wordforms. Conventional subword segmentation algorithms such as character n-grams, Byte-Pair Encoding (BPE), and WordPiece are typically optimized for frequency statistics in text corpora, and do not directly encode phonological or lexical structure—a gap that lexicon-guided decoding approaches explicitly fill.

In the context of speech recognition, using only frequent character sequence statistics can produce subwords that are linguistically ill-formed or phonetically incoherent, leading to segmentation errors that impact recognition accuracy. Lexicon-guided approaches introduce constraints from pronunciation lexica or morphological segmentation, ensuring that token boundaries and units are supported by expert knowledge, phonological alignment, or linguistic ground truth (Xu et al., 2018, Wang et al., 2020, 2207.13333).

2. Lexicon-Guided Subword Inventory Construction

The construction of lexicon-guided subword inventories typically begins from an external lexicon, which can be a pronunciation dictionary (mapping grapheme to phoneme sequences), a word list, or a morpheme/syllable inventory. Several strategies for inventory extraction include:

Pronunciation-assisted Sub-word Modeling (PASM): For each word–phoneme pair in a lexicon, letter–phoneme alignment (e.g., IBM Model 2 fast_align) is performed to find consistent letter–phoneme pairs. Only letter sequences L with sufficient frequency and phoneme alignment consistency are retained. Subword weights are proportional to aligned pair counts, yielding an inventory that reflects both text and phonological structure (Xu et al., 2018).
Syllable/Morpheme-based Lexica: Especially in agglutinative languages, manually curated lists of prefixes, infixes, and suffixes are organized by grammatical category and encoded as subword units. Such inventories may be built as context-marked morpheme units or syllables augmented with continuation markers for concatenation fidelity (2207.13333, Manohar et al., 2023).
Phone/BPE Lexiconization: Phone sequences derived from a lexicon are tokenized using BPE, so that resulting subword tokens correspond directly to phone subsequences, ensuring all subword units have valid phonetic realizations (Wang et al., 2020).
Finite-State Formalization: Tokenizers such as BPE or WordPiece can be formulated as FSTs, and composed with lexicon acceptors, ensuring that only tokenizations resulting in lexicon entries are considered valid (Cognetta et al., 2024).

The extraction and filtering procedures ensure subword vocabularies are compact, linguistically coherent, and of controllable granularity, enabling downstream models to leverage these units efficiently.

3. Formal Integration into Decoding Architectures

Lexicon-guided subword decoding is realized within both WFST-based and neural (sequence-to-sequence or encoder–decoder) frameworks:

WFST-based Decoding: Subword lexica and grammars (e.g., PASM, syllable inventories, morphological grammars) are encoded as weighted finite state transducers, which are composed with standard acoustic, context, and LLMs:

$T = \text{Min}(H \circ C \circ L \circ S \circ G)$

where $S$ is the subword-grammar WFST, $G$ is the (subword) LM, and composition is under the tropical/log semiring. The search for the best decoding path is performed by Viterbi or beam search with hypothesis pruning. Exceptional words (OOVs) are handled by fallback “universal” WFSTs emitting single characters or contextually marked subwords with penalties (2207.13333, Manohar et al., 2023).

Lexicon-Guided Beam Search in Neural Decoders: For encoder–decoder or attention-based models, beam search expansion is constrained by an in-memory prefix tree (trie) or FST representing lexicon-admissible subword sequences. Only subword expansions yielding admissible lexicon paths are allowed in each search step, and word boundaries or completed words trigger extra steps such as LM score combination, segment recombination, or joint verification via parallel decoders (Wang et al., 2020, Tian et al., 26 Jan 2026, Wan et al., 2020).
Tokenization FSTs for Pattern-Constrained Generation: The tokenizer (e.g., BPE/WordPiece) is itself encoded as an FST, and composed with a wordlist or pattern acceptor, producing a transducer whose output is limited to subword sequences that spell words in the lexicon using the canonical tokenization (Cognetta et al., 2024).
Lattice and Pointer-Based Decoding: Lattice-aware encoders and pointer-generator decoders can copy either individual characters or lexicon entries at each step; candidate expansions are gathered dynamically from lexicon-augmented data structures. Scoring and expansion ensure only valid lexicon units or their OOV fallbacks are emitted, with training objectives that marginalize over all possible segmentations for alignment with reference strings (Wan et al., 2020).

4. Empirical Outcomes and Comparative Effectiveness

Quantitative gains from lexicon-guided decoding are consistently demonstrated in ASR and related tasks:

System / Metric	Baseline	BPE	Lexicon-Guided (PASM, Syllable, etc.)
WSJ dev93 WER	20.7%	19.5%	18.5% (PASM) (Xu et al., 2018)
WSJ eval92 WER	15.2%	15.6%	14.3% (PASM)
LibriSpeech dev-clean/test-clean WER	23.8%/23.2%	29.5%/29.5%	21.4%/21.3% (PASM)
SWBD WER	9.2%	7.0%	6.8% (phone-BPE) (Wang et al., 2020)
Tamil WER	24.7%	—	12.31% (SG-WFST+U-WFST) (2207.13333)
Malayalam T3 (high OOV) WER	47.2%	—	43.9% (syllable subword) (Manohar et al., 2023)
Lip-Siri WER	—	—	36.87% (lexicon-guided decoding) (Tian et al., 26 Jan 2026)

Key observations:

PASM and phone-guided BPE reduce WER by 1–2 points over both character and BPE-only baselines in English ASR (Xu et al., 2018, Wang et al., 2020). Gains are larger in clean or morphologically complex domains.
For agglutinative languages (Tamil, Kannada), subword grammar guided beam search yields absolute WER reduction of 12–14 points and dramatically reduces out-of-vocabulary rates (2207.13333).
Syllable-guided lexicons enable up to 10% absolute WER reduction in open-vocabulary ASR for Malayalam, with 3–4× reduction in lexicon size and 2–3× reduction in decoding graph size, especially in high-OOV settings (Manohar et al., 2023).
In silent speech and visionless SSI, lexicon-guided constraints cut insertion/deletion errors and reduce WER by 8–10% relative to unconstrained decoding (Tian et al., 26 Jan 2026).
Lexicon-guided decoding preserves segmentation fidelity, avoids linguistically ill-formed splits, and reduces the learning burden for context-dependent pronunciations and morpheme combinations.

5. Formalisms: FST and Trie-Based Constraints

The integration of lexicon and tokenization constraints is efficiently expressible within FST (finite-state transducer) and trie frameworks:

Finite-State Transduction: Subword tokenizers (e.g., BPE, WordPiece) are constructed as FSTs mapping character inputs to subword outputs. A lexicon is encoded as a character-level finite-state acceptor (trie). Composition yields a transducer that only admits tokenizations corresponding to lexicon entries. Decoding proceeds via shortest-path search or beam search on the composed machine, assigning LLM costs to output transitions (Cognetta et al., 2024).
Prefix Tries: Lexicon-guided beam search tracks partial hypotheses and their position in the trie. Only subword proposals that extend a valid lexicon path are allowed. Completed words trigger resets or emission in the complete word history. This provides both structural constraint and efficient pruning during search (Wang et al., 2020, Tian et al., 26 Jan 2026).
WFST Composition: Morphological or subword grammars are encoded as unweighted or weighted FSTs, composed with acoustic and LLMs. These automata enforce subword sequencing, boundary, and context constraints in the decoding graph (2207.13333, Manohar et al., 2023).
Lattice-Aware Encoders: In copy-network and multi-granularity summarization models, the lexicon is indexed for each input position, and decoder expansions are dynamically constrained by valid lexicon candidates or fallback to per-character units (Wan et al., 2020).

6. Applications and Generalizations

Lexicon-guided subword decoding has been deployed across:

End-to-End ASR: Integration of phonological lexica and subword grammars yields robust open-vocabulary ASR, particularly for morphologically rich or agglutinative languages (Tamil, Kannada, Malayalam) where word-form combinatorics and OOV rates are high (Xu et al., 2018, Manohar et al., 2023, 2207.13333).
Silent Speech Interfaces: Lip-Siri constrains decoder hypotheses to produce only lexicon-admissible word sequences via subword tokenization and a lexicon-guided beam search, demonstrating significant WER improvements in contactless open-vocabulary decoding (Tian et al., 26 Jan 2026).
Text Generation and Summarization: In transformer-based abstractive summarization, lexicon-constrained copying enables the selective generation or copying of whole lexicon entries, facilitating multi-granularity modeling and supporting languages without explicit word delimiters (Wan et al., 2020).
Guided Pattern-Constrained Generation: Finite-state transduction frameworks enable LLMs to simultaneously satisfy subword tokenization constraints and token-level pattern constraints, essential for domains such as guided generation, redaction, or controlled text manipulation (Cognetta et al., 2024).

Limitations emerge when lexica are incomplete, improperly curated, or too fine/coarse in granularity, potentially impeding OOV generalization or introducing LM sparsity in low-data subword LMs (Manohar et al., 2023). In fully in-vocabulary scenarios, word-level modeling may still outperform, due to loss of long-range context under subword decomposition.

7. Representative Algorithms and Implementation Practice

Canonical pseudocode for lexicon-guided beam search or subword segmentation recurs in multiple domains:

Word Segmentation via Lexicon (PASM):
- For each word, segment into subword units maximizing $\sum_i \log w(L_i)$ , using subword weights from alignment frequency. Viterbi or greedy longest-match decoding is employed (Xu et al., 2018).
Trie-Constrained Beam Search:
- Beam hypotheses at step $t$ are extended only by subword tokens reachable in the trie, advancing state pointers and applying the lexicon constraint or penalty as appropriate (Tian et al., 26 Jan 2026, Wang et al., 2020).
FST Composition Search:
- Decoding graph is the composition of acoustic, lexicon, subword grammar, and LLMs; Viterbi/beam search returns the path of minimum total cost, corresponding to a valid sequence of subword tokens spelling out lexicon words (Cognetta et al., 2024, 2207.13333).
Copy Network with Lexicon Integration:
- Lattice-aware encoder with pointer-generator decoder dynamically proposes copy expansions from lexicon entries at each source position, reconciled during beam search, with training objectives marginalizing over all valid segmentations (Wan et al., 2020).

Empirical practice emphasizes:

Filtering subword units by data-driven frequency and lexicon-consistency thresholds.
Fallback mechanisms for OOVs via unigram penalties or single-character emissions.
Explicit boundary and context markers to enable deterministic recombination of subword sequences after decoding.

References

(Xu et al., 2018) Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling
(2207.13333) Knowledge-driven Subword Grammar Modeling for Automatic Speech Recognition in Tamil and Kannada
(Manohar et al., 2023) Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam
(Wang et al., 2020) An investigation of phone-based subword units for end-to-end speech recognition
(Cognetta et al., 2024) Tokenization as Finite-State Transduction
(Wan et al., 2020) Lexicon-constrained Copying Network for Chinese Abstractive Summarization
(Tian et al., 26 Jan 2026) Lip-Siri: Contactless Open-Sentence Silent Speech with Wi-Fi Backscatter