Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hashed N-Gram Feature Space

Updated 13 February 2026
  • Hashed N-Gram Feature Space is a high-dimensional representation that deterministically maps every overlapping n-gram from sequences into fixed-size vectors using hash functions.
  • It reduces the combinatorial explosion of n-grams through efficient hashing and aggregation methods, facilitating dense retrieval and robust classification with minimal information loss.
  • Empirical evaluations, such as NUMEN’s 93.90% Recall@100, demonstrate that tuning hash dimensions effectively balances collision rates and computational efficiency.

A hashed n-gram feature space is a high-dimensional representation in which all (possibly overlapping) n-grams extracted from a sequence (text, DNA, or other) are deterministically mapped into a fixed-size feature vector using a hash function. This mapping collapses the combinatorially large space of distinct n-grams into a manageable dimensionality, supports both dense and sparse usage, incurs only minor information loss when properly parameterized, and allows for efficient large-scale learning, retrieval, and data selection without explicit enumeration of the n-gram vocabulary.

1. Mathematical Construction of Hashed N-Gram Feature Spaces

Let TT denote a sequence (of characters, bytes, or tokens) over an alphabet Σ\Sigma. The process for constructing a hashed n-gram feature space comprises:

  • N-Gram Extraction: For each nn in a prescribed set (e.g., n{3,4,5}n\in\{3,4,5\}), enumerate all substrings gg of length nn in TT. Optionally, boundary markers ensure context fidelity (as in NUMEN, with “\sqcup” and “) (Sharma, 21 Jan 2026).
  • Hashing: Each n-gram gg is mapped to an integer bucket Σ\Sigma0 using a deterministic function; e.g., Σ\Sigma1 (character-level, NUMEN), Σ\Sigma2, with Σ\Sigma3 being FNV or CityHash (byteSteady) (Zhang et al., 2021), or any appropriate hash function.
  • Aggregation: For each bucket Σ\Sigma4, a feature count is computed:

Σ\Sigma5

where Σ\Sigma6 weights n-grams (e.g., longer n-grams receive larger weight) (Sharma, 21 Jan 2026).

  • Post-Processing: Optionally, counts are log-saturated to mimic BM25-style diminishing returns, and/or L2-normalized to obtain a unit vector suitable for maximum inner-product search (MIPS) or cosine similarity retrieval (Sharma, 21 Jan 2026). In classification contexts, the average of embedding lookups indexed by Σ\Sigma7 is used (Zhang et al., 2021).
  • Dimensionality: The user sets Σ\Sigma8 arbitrarily large for collision mitigation and separability, trading off memory for accuracy.

2. Hash Function Choice, Independence, and Computational Considerations

The choice and analysis of hash functions for n-grams have profound implications:

  • Pairwise Independence Optimality: Recursive (rolling) hash families, regardless of how constructed, can be at most pairwise independent; full k-wise independence (Σ\Sigma9) is impossible for sliding n-gram windows (0705.4676).
  • Irreducible vs. Cyclic Hashes: Hashes built on irreducible polynomials over nn0 are formally pairwise independent but require nn1 operations per update; cyclic polynomial schemes, which correspond to bitwise rotations and XORs, are nn2 per update and, after dropping nn3 bits, are also pairwise independent—this yields practical throughputs exceeding 100 million n-grams/sec per thread (0705.4676).
  • Uniformity and Collisions: CRC32, FNV, and CityHash offer fast, hardware-accelerated, and empirically uniform distributions (Sharma, 21 Jan 2026, Zhang et al., 2021), with negligible bias and collision probability provided that nn4 is chosen commensurately with the expected number of n-grams per instance (see Section 3 below).

3. Collision Probability and Dimensionality Selection

Collisions—two distinct n-grams mapping to the same bucket—are a core property of the hashing trick:

  • Birthday Paradox Analysis: Given nn5 n-grams and hash space size nn6, the collision probability is approximately

nn7

For example, with nn8 and nn9, n{3,4,5}n\in\{3,4,5\}0 (Sharma, 21 Jan 2026).

  • Empirical Plateaus: Retrieval or classification quality plateaus—recall ceases to increase—when collisions become prevalent. In NUMEN, increasing n{3,4,5}n\in\{3,4,5\}1 from n{3,4,5}n\in\{3,4,5\}2 to n{3,4,5}n\in\{3,4,5\}3 yields monotonic recall improvements, with saturation occurring as collisions surpass a few percent (Sharma, 21 Jan 2026); byteSteady observes similar diminishing returns above n{3,4,5}n\in\{3,4,5\}4 for byte-level tasks (Zhang et al., 2021).
  • Theoretical Guidance: The tradeoff between n{3,4,5}n\in\{3,4,5\}5, n{3,4,5}n\in\{3,4,5\}6 (hash dimension), n{3,4,5}n\in\{3,4,5\}7 (relative distortion tolerance), and n{3,4,5}n\in\{3,4,5\}8 (failure probability) is now fully characterized: for n{3,4,5}n\in\{3,4,5\}9,
    • gg0 suffices for all gg1 (Freksen et al., 2018);
    • For less "peaky" vectors, smaller gg2 is possible, with the hashing trick achieving Johnson-Lindenstrauss-type bounds up to mild constant and log-log factors (Freksen et al., 2018).

4. Applications: Retrieval, Classification, and Data Selection

Hashed n-gram feature spaces underpin a variety of large-scale machine learning and information retrieval systems:

  • Dense Retrieval (NUMEN): Documents and queries are encoded into L2-normalized hashed n-gram vectors. Retrieval is performed via maximum inner product search, using standard tools (e.g., FAISS), and at sufficient gg3 matches or exceeds classic BM25 recall, achieving 93.90% Recall@100 at gg4 on the LIMIT benchmark (Sharma, 21 Jan 2026). The principal advantage is elimination of embedding bottlenecks: the geometry scales as gg5 increases, directly targeting the "sign-rank" bottleneck of low-dimensional learned representational spaces (Sharma, 21 Jan 2026).
  • Classification (byteSteady): Each byte-level n-gram is mapped via a hash function to a compact embedding table, and the per-instance representation is the average of selected embeddings. byteSteady achieves state-of-the-art or near-SOTA text and gene classification, robust to n-gram collisions and with modest model footprint (e.g., 256 MB at gg6, gg7, gg8), while allowing optional compression-based speedups (Zhang et al., 2021).
  • Data Selection (DSIR): Hashed n-gram histograms are used as tractable proxies for true n-gram distributions in multi-billion document corpora, enabling efficient calculation of document importance weights and associated data curation for LLM training. KL-reduction evaluated in the hashed n-gram space correlates gg9 with downstream accuracy (Xie et al., 2023).

Empirical Results: Recall@K vs. Hash Dimension (NUMEN, LIMIT Benchmark)

Dimension Recall@2 Recall@10 Recall@100
512 2.70% 7.15% 21.30%
1024 13.20% 23.85% 45.10%
2048 33.05% 49.45% 68.80%
4096 56.50% 70.10% 83.20%
8192 70.65% 81.60% 89.85%
16384 79.45% 86.65% 93.05%
32768 81.45% 88.00% 93.90%

For comparison, BM25’s Recall@100 is 93.6% (Sharma, 21 Jan 2026).

5. Algorithmic and Practical Design Patterns

  • N-gram Order Selection: Classification and retrieval benefit from multi-scale n-gram inclusion; e.g., byteSteady reports best results with nn0 (text), nn1 (genes) (Zhang et al., 2021). NUMEN fixes nn2 (Sharma, 21 Jan 2026).
  • Weighting and Saturation: Assigning larger weights to longer n-grams and log-saturating repeated grams improve robustness to redundancy and match traditional IR heuristics (e.g., BM25) (Sharma, 21 Jan 2026).
  • Learning vs. Determinism: NUMEN is fully training-free; byteSteady learns embedding vectors but uses fixed hashing; DSIR operates entirely with hashed count vectors and generative mixture models (Sharma, 21 Jan 2026, Zhang et al., 2021, Xie et al., 2023).
  • Compression Extensions: In byte-level settings, n-gram extraction can be performed on Huffman-compressed inputs. Lossless compression reduces runtime per document nearly linearly, with minimal impact on classification error for light compression (e.g., 0.2–0.5% increase), but more aggressive compression creates an explicit accuracy-speed trade-off frontier (Zhang et al., 2021).
  • Streaming and Scalability: Hashed n-gram feature maps can be constructed in nn3 time per document (with nn4 its length), require modest per-document memory (scaling with nn5 or nn6), and support streaming aggregation or distributed computation (Xie et al., 2023).

6. Theoretical Guarantees and Limitations

  • Norm Preservation: When nn7 is small (typical for n-gram histograms), hashed feature spaces behave like sparse Johnson-Lindenstrauss transforms. The central bound is: for nn8,

nn9

for TT0, otherwise as a precise function of TT1 (Freksen et al., 2018).

  • Hash Family Result: Recursive n-gram hashing cannot exceed pairwise independence due to window overlap constraints; no rolling hash attains full 3-wise independence (0705.4676).
  • Collision Robustness: Empirically, collisions at the few-percent level have little effect; both learning-based (byteSteady) and unsupervised (DSIR) pipelines are robust by virtue of averaging embeddings or focusing on histogram statistics (Zhang et al., 2021, Xie et al., 2023).
  • Trade-offs: There is a direct, quantifiable trade-off among feature space dimensionality, collision rate, empirical accuracy (retrieval/classification), and system memory requirements (Sharma, 21 Jan 2026, Zhang et al., 2021, Freksen et al., 2018).

7. Extensions, Impact, and Open Challenges

Hashed n-gram feature spaces enable scalable learning and retrieval over unlimited vocabularies, support flexible dimensionality allocation per task, and admit practical hardware-accelerated implementations. They are now deployed across dense retrieval, fast multiclass text and DNA classification, and billions-scale document resampling for LLM pretraining (Sharma, 21 Jan 2026, Zhang et al., 2021, Xie et al., 2023). Open directions include architectural integration with transformer LLMs (specifically for retrieval augmentation), dynamic or adaptive dimensionality for low-resource tasks, and rigorous analysis of the expressivity–collision trade-off in broader mixture-of-expert and continual learning settings.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hashed N-Gram Feature Space.