N-gram HD Encoders & Transformer Fusion

Updated 30 December 2025

N-gram HD encoders are computational frameworks that represent n-gram statistics and contextual spans as fixed-length, high-dimensional vectors using hyperdimensional computing principles.
They leverage binding and bundling operations to encode n-gram sequences, enabling efficient integration into both classic classifiers and modern Transformer-based models.
Empirical results show these encoders offer significant trade-offs with marked improvements in speed and memory usage while maintaining competitive accuracy across diverse NLP tasks.

N-gram HD (High-Dimensional) Encoders are computational frameworks for representing n-gram statistics and contextual span information from text as fixed-length, high-dimensional vectors. These approaches synthesize ideas from hyperdimensional computing (HDC) and neural encoding architectures to produce distributed, resource-efficient representations suitable for both classic classifiers and modern Transformer-based models. Two principal lines—hyperdimensional binding/bundling schemes (Alonso et al., 2020) and n-gram Transformer fusion architectures (Song et al., 2021)—define current methodologies.

1. Hyperdimensional Computing Principles for N-gram Encoding

Hyperdimensional (HD) or Vector-Symbolic Architectures encode symbols, sequences, and sets by associating each with a high-dimensional vector, typically with dimension $D \in [10^3, 10^5]$ . Each base vector $v_S \in \{\pm1\}^D$ is drawn i.i.d. and nearly orthogonal. This property supports distributed operations:

Binding (element-wise multiplication, $\odot$ ): Used to encode sequence or order, crucial for n-grams.
Bundling (element-wise addition, $+$ ): Aggregates multiple hypervectors (e.g., n-gram counts) in the same $D$ -dimensional space.

For character n-gram encoding: each character is mapped to a random bipolar vector, permuted by a fixed operator $\rho$ to encode positional information, and bound together to form n-gram representations. These are aggregated across a text to form a single $D$ -dimensional summary vector. The process is defined as:

$v_w = \bigodot_{j=1}^n \rho^j(v_{S_j}), \qquad V = \sum_{w} c(w)\,v_w, \qquad \widehat{V} = \frac{V}{\|V\|_2}$

where $c(w)$ is the n-gram count in a document, and $\rho$ denotes a fixed cyclic permutation of vector coordinates.

2. Algorithmic Workflow and Pseudocode

A standard workflow for N-gram HD encoding (Alonso et al., 2020) consists of the following steps:

Initialization: Assign each symbol $S$ in the alphabet $\Sigma$ a random bipolar base vector $v_S \in \{\pm1\}^D$ .
N-gram Extraction: Use a sliding window over the text $D$ to enumerate all overlapping character n-grams.
Binding: For each n-gram $w$ , bind the permuted symbol hypervectors as above.
Bundling: Accumulate the bound n-gram hypervectors into the sum $V$ .
Normalization: Obtain $\widehat{V}$ by $\ell_2$ normalization.
Classifier Input: The normalized HD vector is input to standard classifiers.

Pseudocode for the encoder (as per (Alonso et al., 2020)):

Initialize item memory: for each S∈Σ, v_S ← random bipolar vector in {±1}^D
V ← 0 ∈ ℝ^D
for each n-gram w in D:
    h ← v_{D[i]}
    for j = 2 to n:
        h ← h ⊙ ρ^{j}(v_{D[i + j - 1]})
    V ← V + h
normalize: ẊV ← V / ||V||₂
return ẊV

The computational complexity is $O(|D| \cdot n \cdot D)$ .

3. Trade-offs: Dimensionality, Context Size, and Resource Efficiency

A key advantage of N-gram HD encoders is the decoupling of output dimensionality $D$ from the combinatorial explosion in $n$ (traditional n-gram models require $a^n$ counters; HD encoding always yields a $D$ -dimensional vector). Selection of $D$ governs fidelity and resource consumption:

Larger $D$ : Higher fidelity to true n-gram histograms, increased classification accuracy, with increased memory and compute cost.
Smaller $D$ : Substantial memory and throughput savings, potential loss in accuracy if under-parameterized.

Empirical results indicate that F $_1$ scores increase rapidly for small $D$ (e.g., $D = 32$ offers marked improvement), saturating at dataset-dependent $D^*$ . For instance, $D^* \approx 512$ for small corpora, $D^* \approx 4096$ for large (Alonso et al., 2020). Resource improvements scale proportionally: train/test speedups and memory reductions of $10\times$ to $100\times$ over dense n-gram representations are common at minimal accuracy loss.

4. Integration into Transformer Architectures: N-gram Fusion Techniques

In neural text encoders such as ZEN 2.0 (Song et al., 2021), n-gram high-dimensional embeddings are constructed via unsupervised statistical extraction and Transformer-based contextualization:

N-gram Extraction: Spans of length $2$–$8$ are selected by pointwise mutual information (PMI) and frequency thresholds; for Chinese, PMI $\ge3$ , $c(w)\ge15$ yields $|V_{ng}|\approx261,000$ ; for Arabic, PMI $\ge10$ , $c(w)\ge20$ , $|V_{ng}|\approx194,000$ .
HD Embedding & Encoding: A learnable lookup table $E_{ng}\in\mathbb{R}^{|V_{ng}|\times d}$ (with $d=768$ or $1024$), followed by a six-layer Transformer encoder for contextualization.

Fusion with character/token-level Transformer states is executed layer-wise. Let $\mathbf{v}_i^{(l)}$ denote the token state at layer $l$ ; overlapping n-grams $\{g_{i,1},...,g_{i,K_i}\}$ contribute weighted contextual vectors. Fusion is done by: $\mathbf{v}_i^{(l)*} = \mathbf{v}_i^{(l)} + \sum_{k=1}^{K_i} p_{i,k} \boldsymbol\mu_{i,k}^{(l)}$ where $p_{i,k} = \frac{c(g_{i,k})}{\sum_{k'} c(g_{i,k'})}$ ; no gating or concatenation is used. N-gram signals augment token-level representations directly. The n-gram encoder operates as a parallel Transformer.

5. Empirical Evaluation and Performance Metrics

HyperEmbed (Alonso et al., 2020) evaluated N-gram HD encoders on three small (Chatbot, AskUbuntu, WebApplication) and one large (20NewsGroups) corpus:

Baselines: Conventional character n-gram counts ( $\sim$ 200,000 dimensions).
HD Embeddings: $n=2$ –4, $D=2^5$ – $2^{14}$ .
Classifiers: Ridge, KNN, MLP, PA, RF, LSVC, SGD, NC, BNB.

Key results:

For AskUbuntu (MLP, $n=3$ , $D=512$ ): F $_1$ = 0.91 vs. baseline 0.92, $4.62\times$ faster training, $3.84\times$ faster test, $6.18\times$ memory reduction.
For 20NewsGroups ( $D=2048$ , $n=2$ –3): Most classifiers maintained $\sim$ 90% of baseline F $_1$ with $50$– $200\times$ speedup and $100\times$ memory reduction.

Linear classifiers and shallow MLPs generally outperformed local or tree-based models, which lost accuracy due to the distributed representation's smoothing effects.

ZEN 2.0 (Song et al., 2021) demonstrates consistent state-of-the-art improvements across a battery of Chinese and Arabic NLP tasks (e.g., MSR-CWS F $_1$ = 98.66, CMRC2018 F $_1$ = 89.92), outperforming prior benchmarks typically by 0.1–2.0 points in absolute metric terms.

6. Practical Guidelines and Adaptation Considerations

Recommendations for practitioners (Alonso et al., 2020, Song et al., 2021):

Select $n=2$ –4 in small corpora; $n=2$ –3 in large.
Sweep $D$ from $2^5$ to $2^{14}$ ; select minimal $D$ achieving $95$–98% of baseline accuracy.
Prefer linear and shallow neural classifiers for HD vectors.
For resource-constrained environments, binarize encodings and classifier weights.
ZEN 2.0 architecture adapts to multiple languages (Chinese, Arabic) and domains via threshold tuning and separate n-gram vocabularies, without structural changes.

7. Significance and Applications

N-gram HD encoders address scaling and efficiency challenges in embedding n-gram statistics for NLP tasks. By leveraging high-dimensionality and distributed encoding, they provide concise, accurate representations with dramatic resource savings. They integrate seamlessly into classic ML pipelines and Transformer-based neural architectures, enabling robust, end-to-end modeling of contextual spans. This framework offers practical trade-offs between memory, speed, and accuracy, and demonstrates superiority in multilingual, multi-domain applications, verified experimentally on several production-scale corpora (Alonso et al., 2020, Song et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

HyperEmbed: Tradeoffs Between Resources and Performance in NLP Tasks with Hyperdimensional Computing enabled Embedding of n-gram Statistics (2020)

ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to N-gram HD Encoders.