Papers
Topics
Authors
Recent
Search
2000 character limit reached

N-gram HD Encoders & Transformer Fusion

Updated 30 December 2025
  • N-gram HD encoders are computational frameworks that represent n-gram statistics and contextual spans as fixed-length, high-dimensional vectors using hyperdimensional computing principles.
  • They leverage binding and bundling operations to encode n-gram sequences, enabling efficient integration into both classic classifiers and modern Transformer-based models.
  • Empirical results show these encoders offer significant trade-offs with marked improvements in speed and memory usage while maintaining competitive accuracy across diverse NLP tasks.

N-gram HD (High-Dimensional) Encoders are computational frameworks for representing n-gram statistics and contextual span information from text as fixed-length, high-dimensional vectors. These approaches synthesize ideas from hyperdimensional computing (HDC) and neural encoding architectures to produce distributed, resource-efficient representations suitable for both classic classifiers and modern Transformer-based models. Two principal lines—hyperdimensional binding/bundling schemes (Alonso et al., 2020) and n-gram Transformer fusion architectures (Song et al., 2021)—define current methodologies.

1. Hyperdimensional Computing Principles for N-gram Encoding

Hyperdimensional (HD) or Vector-Symbolic Architectures encode symbols, sequences, and sets by associating each with a high-dimensional vector, typically with dimension D[103,105]D \in [10^3, 10^5]. Each base vector vS{±1}Dv_S \in \{\pm1\}^D is drawn i.i.d. and nearly orthogonal. This property supports distributed operations:

  • Binding (element-wise multiplication, \odot): Used to encode sequence or order, crucial for n-grams.
  • Bundling (element-wise addition, ++): Aggregates multiple hypervectors (e.g., n-gram counts) in the same DD-dimensional space.

For character n-gram encoding: each character is mapped to a random bipolar vector, permuted by a fixed operator ρ\rho to encode positional information, and bound together to form n-gram representations. These are aggregated across a text to form a single DD-dimensional summary vector. The process is defined as:

vw=j=1nρj(vSj),V=wc(w)vw,V^=VV2v_w = \bigodot_{j=1}^n \rho^j(v_{S_j}), \qquad V = \sum_{w} c(w)\,v_w, \qquad \widehat{V} = \frac{V}{\|V\|_2}

where c(w)c(w) is the n-gram count in a document, and ρ\rho denotes a fixed cyclic permutation of vector coordinates.

2. Algorithmic Workflow and Pseudocode

A standard workflow for N-gram HD encoding (Alonso et al., 2020) consists of the following steps:

  1. Initialization: Assign each symbol SS in the alphabet Σ\Sigma a random bipolar base vector vS{±1}Dv_S \in \{\pm1\}^D.
  2. N-gram Extraction: Use a sliding window over the text DD to enumerate all overlapping character n-grams.
  3. Binding: For each n-gram ww, bind the permuted symbol hypervectors as above.
  4. Bundling: Accumulate the bound n-gram hypervectors into the sum VV.
  5. Normalization: Obtain V^\widehat{V} by 2\ell_2 normalization.
  6. Classifier Input: The normalized HD vector is input to standard classifiers.

Pseudocode for the encoder (as per (Alonso et al., 2020)):

1
2
3
4
5
6
7
8
9
Initialize item memory: for each S∈Σ, v_S ← random bipolar vector in {±1}^D
V ← 0 ∈ ℝ^D
for each n-gram w in D:
    h ← v_{D[i]}
    for j = 2 to n:
        h ← h ⊙ ρ^{j}(v_{D[i + j - 1]})
    V ← V + h
normalize: ẊV ← V / ||V||₂
return ẊV

The computational complexity is O(DnD)O(|D| \cdot n \cdot D).

3. Trade-offs: Dimensionality, Context Size, and Resource Efficiency

A key advantage of N-gram HD encoders is the decoupling of output dimensionality DD from the combinatorial explosion in nn (traditional n-gram models require ana^n counters; HD encoding always yields a DD-dimensional vector). Selection of DD governs fidelity and resource consumption:

  • Larger DD: Higher fidelity to true n-gram histograms, increased classification accuracy, with increased memory and compute cost.
  • Smaller DD: Substantial memory and throughput savings, potential loss in accuracy if under-parameterized.

Empirical results indicate that F1_1 scores increase rapidly for small DD (e.g., D=32D = 32 offers marked improvement), saturating at dataset-dependent DD^*. For instance, D512D^* \approx 512 for small corpora, D4096D^* \approx 4096 for large (Alonso et al., 2020). Resource improvements scale proportionally: train/test speedups and memory reductions of 10×10\times to 100×100\times over dense n-gram representations are common at minimal accuracy loss.

4. Integration into Transformer Architectures: N-gram Fusion Techniques

In neural text encoders such as ZEN 2.0 (Song et al., 2021), n-gram high-dimensional embeddings are constructed via unsupervised statistical extraction and Transformer-based contextualization:

  • N-gram Extraction: Spans of length $2$–$8$ are selected by pointwise mutual information (PMI) and frequency thresholds; for Chinese, PMI3\ge3, c(w)15c(w)\ge15 yields Vng261,000|V_{ng}|\approx261,000; for Arabic, PMI10\ge10, c(w)20c(w)\ge20, Vng194,000|V_{ng}|\approx194,000.
  • HD Embedding & Encoding: A learnable lookup table EngRVng×dE_{ng}\in\mathbb{R}^{|V_{ng}|\times d} (with d=768d=768 or $1024$), followed by a six-layer Transformer encoder for contextualization.

Fusion with character/token-level Transformer states is executed layer-wise. Let vi(l)\mathbf{v}_i^{(l)} denote the token state at layer ll; overlapping n-grams {gi,1,...,gi,Ki}\{g_{i,1},...,g_{i,K_i}\} contribute weighted contextual vectors. Fusion is done by: vi(l)=vi(l)+k=1Kipi,kμi,k(l)\mathbf{v}_i^{(l)*} = \mathbf{v}_i^{(l)} + \sum_{k=1}^{K_i} p_{i,k} \boldsymbol\mu_{i,k}^{(l)} where pi,k=c(gi,k)kc(gi,k)p_{i,k} = \frac{c(g_{i,k})}{\sum_{k'} c(g_{i,k'})}; no gating or concatenation is used. N-gram signals augment token-level representations directly. The n-gram encoder operates as a parallel Transformer.

5. Empirical Evaluation and Performance Metrics

HyperEmbed (Alonso et al., 2020) evaluated N-gram HD encoders on three small (Chatbot, AskUbuntu, WebApplication) and one large (20NewsGroups) corpus:

  • Baselines: Conventional character n-gram counts (\sim200,000 dimensions).
  • HD Embeddings: n=2n=2–4, D=25D=2^52142^{14}.
  • Classifiers: Ridge, KNN, MLP, PA, RF, LSVC, SGD, NC, BNB.

Key results:

  • For AskUbuntu (MLP, n=3n=3, D=512D=512): F1_1 = 0.91 vs. baseline 0.92, 4.62×4.62\times faster training, 3.84×3.84\times faster test, 6.18×6.18\times memory reduction.
  • For 20NewsGroups (D=2048D=2048, n=2n=2–3): Most classifiers maintained \sim90% of baseline F1_1 with $50$–200×200\times speedup and 100×100\times memory reduction.

Linear classifiers and shallow MLPs generally outperformed local or tree-based models, which lost accuracy due to the distributed representation's smoothing effects.

ZEN 2.0 (Song et al., 2021) demonstrates consistent state-of-the-art improvements across a battery of Chinese and Arabic NLP tasks (e.g., MSR-CWS F1_1 = 98.66, CMRC2018 F1_1 = 89.92), outperforming prior benchmarks typically by 0.1–2.0 points in absolute metric terms.

6. Practical Guidelines and Adaptation Considerations

Recommendations for practitioners (Alonso et al., 2020, Song et al., 2021):

  • Select n=2n=2–4 in small corpora; n=2n=2–3 in large.
  • Sweep DD from 252^5 to 2142^{14}; select minimal DD achieving $95$–98% of baseline accuracy.
  • Prefer linear and shallow neural classifiers for HD vectors.
  • For resource-constrained environments, binarize encodings and classifier weights.
  • ZEN 2.0 architecture adapts to multiple languages (Chinese, Arabic) and domains via threshold tuning and separate n-gram vocabularies, without structural changes.

7. Significance and Applications

N-gram HD encoders address scaling and efficiency challenges in embedding n-gram statistics for NLP tasks. By leveraging high-dimensionality and distributed encoding, they provide concise, accurate representations with dramatic resource savings. They integrate seamlessly into classic ML pipelines and Transformer-based neural architectures, enabling robust, end-to-end modeling of contextual spans. This framework offers practical trade-offs between memory, speed, and accuracy, and demonstrates superiority in multilingual, multi-domain applications, verified experimentally on several production-scale corpora (Alonso et al., 2020, Song et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to N-gram HD Encoders.