Hashed N-Gram Feature Space
- Hashed N-Gram Feature Space is a high-dimensional representation that deterministically maps every overlapping n-gram from sequences into fixed-size vectors using hash functions.
- It reduces the combinatorial explosion of n-grams through efficient hashing and aggregation methods, facilitating dense retrieval and robust classification with minimal information loss.
- Empirical evaluations, such as NUMEN’s 93.90% Recall@100, demonstrate that tuning hash dimensions effectively balances collision rates and computational efficiency.
A hashed n-gram feature space is a high-dimensional representation in which all (possibly overlapping) n-grams extracted from a sequence (text, DNA, or other) are deterministically mapped into a fixed-size feature vector using a hash function. This mapping collapses the combinatorially large space of distinct n-grams into a manageable dimensionality, supports both dense and sparse usage, incurs only minor information loss when properly parameterized, and allows for efficient large-scale learning, retrieval, and data selection without explicit enumeration of the n-gram vocabulary.
1. Mathematical Construction of Hashed N-Gram Feature Spaces
Let denote a sequence (of characters, bytes, or tokens) over an alphabet . The process for constructing a hashed n-gram feature space comprises:
- N-Gram Extraction: For each in a prescribed set (e.g., ), enumerate all substrings of length in . Optionally, boundary markers ensure context fidelity (as in NUMEN, with “” and “) (Sharma, 21 Jan 2026).
- Hashing: Each n-gram is mapped to an integer bucket using a deterministic function; e.g., (character-level, NUMEN), , with being FNV or CityHash (byteSteady) (Zhang et al., 2021), or any appropriate hash function.
- Aggregation: For each bucket , a feature count is computed:
where weights n-grams (e.g., longer n-grams receive larger weight) (Sharma, 21 Jan 2026).
- Post-Processing: Optionally, counts are log-saturated to mimic BM25-style diminishing returns, and/or L2-normalized to obtain a unit vector suitable for maximum inner-product search (MIPS) or cosine similarity retrieval (Sharma, 21 Jan 2026). In classification contexts, the average of embedding lookups indexed by is used (Zhang et al., 2021).
- Dimensionality: The user sets arbitrarily large for collision mitigation and separability, trading off memory for accuracy.
2. Hash Function Choice, Independence, and Computational Considerations
The choice and analysis of hash functions for n-grams have profound implications:
- Pairwise Independence Optimality: Recursive (rolling) hash families, regardless of how constructed, can be at most pairwise independent; full k-wise independence () is impossible for sliding n-gram windows (0705.4676).
- Irreducible vs. Cyclic Hashes: Hashes built on irreducible polynomials over are formally pairwise independent but require operations per update; cyclic polynomial schemes, which correspond to bitwise rotations and XORs, are per update and, after dropping bits, are also pairwise independent—this yields practical throughputs exceeding 100 million n-grams/sec per thread (0705.4676).
- Uniformity and Collisions: CRC32, FNV, and CityHash offer fast, hardware-accelerated, and empirically uniform distributions (Sharma, 21 Jan 2026, Zhang et al., 2021), with negligible bias and collision probability provided that is chosen commensurately with the expected number of n-grams per instance (see Section 3 below).
3. Collision Probability and Dimensionality Selection
Collisions—two distinct n-grams mapping to the same bucket—are a core property of the hashing trick:
- Birthday Paradox Analysis: Given n-grams and hash space size , the collision probability is approximately
For example, with and , (Sharma, 21 Jan 2026).
- Empirical Plateaus: Retrieval or classification quality plateaus—recall ceases to increase—when collisions become prevalent. In NUMEN, increasing from to yields monotonic recall improvements, with saturation occurring as collisions surpass a few percent (Sharma, 21 Jan 2026); byteSteady observes similar diminishing returns above for byte-level tasks (Zhang et al., 2021).
- Theoretical Guidance: The tradeoff between , (hash dimension), (relative distortion tolerance), and (failure probability) is now fully characterized: for ,
- suffices for all (Freksen et al., 2018);
- For less "peaky" vectors, smaller is possible, with the hashing trick achieving Johnson-Lindenstrauss-type bounds up to mild constant and log-log factors (Freksen et al., 2018).
4. Applications: Retrieval, Classification, and Data Selection
Hashed n-gram feature spaces underpin a variety of large-scale machine learning and information retrieval systems:
- Dense Retrieval (NUMEN): Documents and queries are encoded into L2-normalized hashed n-gram vectors. Retrieval is performed via maximum inner product search, using standard tools (e.g., FAISS), and at sufficient matches or exceeds classic BM25 recall, achieving 93.90% Recall@100 at on the LIMIT benchmark (Sharma, 21 Jan 2026). The principal advantage is elimination of embedding bottlenecks: the geometry scales as increases, directly targeting the "sign-rank" bottleneck of low-dimensional learned representational spaces (Sharma, 21 Jan 2026).
- Classification (byteSteady): Each byte-level n-gram is mapped via a hash function to a compact embedding table, and the per-instance representation is the average of selected embeddings. byteSteady achieves state-of-the-art or near-SOTA text and gene classification, robust to n-gram collisions and with modest model footprint (e.g., 256 MB at , , ), while allowing optional compression-based speedups (Zhang et al., 2021).
- Data Selection (DSIR): Hashed n-gram histograms are used as tractable proxies for true n-gram distributions in multi-billion document corpora, enabling efficient calculation of document importance weights and associated data curation for LLM training. KL-reduction evaluated in the hashed n-gram space correlates with downstream accuracy (Xie et al., 2023).
Empirical Results: Recall@K vs. Hash Dimension (NUMEN, LIMIT Benchmark)
| Dimension | Recall@2 | Recall@10 | Recall@100 |
|---|---|---|---|
| 512 | 2.70% | 7.15% | 21.30% |
| 1024 | 13.20% | 23.85% | 45.10% |
| 2048 | 33.05% | 49.45% | 68.80% |
| 4096 | 56.50% | 70.10% | 83.20% |
| 8192 | 70.65% | 81.60% | 89.85% |
| 16384 | 79.45% | 86.65% | 93.05% |
| 32768 | 81.45% | 88.00% | 93.90% |
For comparison, BM25’s Recall@100 is 93.6% (Sharma, 21 Jan 2026).
5. Algorithmic and Practical Design Patterns
- N-gram Order Selection: Classification and retrieval benefit from multi-scale n-gram inclusion; e.g., byteSteady reports best results with (text), (genes) (Zhang et al., 2021). NUMEN fixes (Sharma, 21 Jan 2026).
- Weighting and Saturation: Assigning larger weights to longer n-grams and log-saturating repeated grams improve robustness to redundancy and match traditional IR heuristics (e.g., BM25) (Sharma, 21 Jan 2026).
- Learning vs. Determinism: NUMEN is fully training-free; byteSteady learns embedding vectors but uses fixed hashing; DSIR operates entirely with hashed count vectors and generative mixture models (Sharma, 21 Jan 2026, Zhang et al., 2021, Xie et al., 2023).
- Compression Extensions: In byte-level settings, n-gram extraction can be performed on Huffman-compressed inputs. Lossless compression reduces runtime per document nearly linearly, with minimal impact on classification error for light compression (e.g., 0.2–0.5% increase), but more aggressive compression creates an explicit accuracy-speed trade-off frontier (Zhang et al., 2021).
- Streaming and Scalability: Hashed n-gram feature maps can be constructed in time per document (with its length), require modest per-document memory (scaling with or ), and support streaming aggregation or distributed computation (Xie et al., 2023).
6. Theoretical Guarantees and Limitations
- Norm Preservation: When is small (typical for n-gram histograms), hashed feature spaces behave like sparse Johnson-Lindenstrauss transforms. The central bound is: for ,
for , otherwise as a precise function of (Freksen et al., 2018).
- Hash Family Result: Recursive n-gram hashing cannot exceed pairwise independence due to window overlap constraints; no rolling hash attains full 3-wise independence (0705.4676).
- Collision Robustness: Empirically, collisions at the few-percent level have little effect; both learning-based (byteSteady) and unsupervised (DSIR) pipelines are robust by virtue of averaging embeddings or focusing on histogram statistics (Zhang et al., 2021, Xie et al., 2023).
- Trade-offs: There is a direct, quantifiable trade-off among feature space dimensionality, collision rate, empirical accuracy (retrieval/classification), and system memory requirements (Sharma, 21 Jan 2026, Zhang et al., 2021, Freksen et al., 2018).
7. Extensions, Impact, and Open Challenges
Hashed n-gram feature spaces enable scalable learning and retrieval over unlimited vocabularies, support flexible dimensionality allocation per task, and admit practical hardware-accelerated implementations. They are now deployed across dense retrieval, fast multiclass text and DNA classification, and billions-scale document resampling for LLM pretraining (Sharma, 21 Jan 2026, Zhang et al., 2021, Xie et al., 2023). Open directions include architectural integration with transformer LLMs (specifically for retrieval augmentation), dynamic or adaptive dimensionality for low-resource tasks, and rigorous analysis of the expressivity–collision trade-off in broader mixture-of-expert and continual learning settings.