Infini-gram Mini: Exact Scalable n-Gram Search
- Infini-gram mini is a scalable, exact n-gram search system that efficiently indexes and retrieves patterns from multi-terabyte to petabyte corpora.
- It leverages FM-index, wavelet trees, and sampled suffix arrays to achieve high compression ratios and rapid query responses on large datasets.
- Its distributed, parallel architecture supports large-scale tasks including data auditing, benchmark contamination detection, and linguistic analysis.
Infini-gram mini is a scalable, exact n-gram search system engineered for efficient operation over corpus scales reaching tens of terabytes to petabytes. Leveraging advanced compressed index structures, Infini-gram mini enables fast and memory-efficient search and document retrieval on raw web-scale text, supporting critical tasks such as large-scale data auditing, benchmark contamination detection, and linguistic analysis. Its design is rooted in both suffix array-based and FM-index-based methodologies, and it can be considered a high-compression, distributed evolution of prior “Infini-gram” approaches to n-gram language modeling and large-corpus pattern search (Xu et al., 13 Jun 2025, Liu et al., 2024).
1. Core Principles and Data Structures
Infini-gram mini is built on the FM-index, a succinct data structure integrating text indexing and compression. For a corpus string of length over alphabet :
- Sampled Suffix Array (SA) and Inverse Suffix Array (ISA): Rather than storing the entire suffix array (lexicographically sorted suffix positions) at bits, only every th (SA) and th (ISA) element are kept. Missing entries are located as needed via LF-mapping.
- Burrows–Wheeler Transform (BWT): The transformed string is computed such that (if where ), or a unique end-of-text symbol otherwise.
- Wavelet Tree: A Huffman-shaped wavelet tree over provides time for queries, where (empirical entropy) is typically ≈2.1 bits/byte for natural text.
The LF (“last-to-first”) mapping supports backward traversal and recovery of unsampled entries: where counts lexicographically smaller symbols than in .
Empirical index size is the corpus, with theoretical lower bound as .
2. System Pipeline, Architecture, and Parallelism
Infini-gram mini operates on clusters of CPU nodes (e.g., 2 TiB RAM, high-performance SSD per node):
- Text Preprocessing:
- Raw UTF-8 documents are concatenated using a unique delimiter (byte
\xff). - A text-offset file records document boundaries.
- Raw UTF-8 documents are concatenated using a unique delimiter (byte
- Sharding:
- The corpus is split into independent 700 GB shards.
- Each node indexes a shard in parallel using all vCPUs.
- Index Construction (per shard):
- Full SA and BWT are built via parallel induced sorting.
- Wavelet tree construction is parallelized via bitvector operations.
- SA and ISA are sampled by parallel scanning.
Memory use during indexing is less than prior SDSL implementations: for an 8.7 GB shard, RAM drops from 75 GB (SDSL) to 23.7 GB, and walltime from 5,847 s to 324 s ( faster).
Indexing 46 TB of web text completes in 50 days on a single 128-core node, or 19 hours across 75 nodes. Index files are memory-mapped at query time, resulting in GB RAM usage.
3. Exact n-Gram Query Algorithms
Infini-gram mini supports two central operations:
- Counting Occurrences: Implements right-to-left FM-index backward search:
1 2 3 4 5 6 7 |
(sp, ep) ← (0, n) for i from |Q|−1 downto 0: c ← Q[i] sp ← C[c] + rank(c, sp) ep ← C[c] + rank(c, ep) if sp ≥ ep: return 0 return ep − sp |
Query time is over shards. Under typical GCP SSDs, counts for return in s; for in 8–25 s (per 28 TB–17 TB corpus).
- Document Retrieval: For each matching SA index in :
- Locate byte offset via LF-mapping (up to steps).
- Binary search the text-offset file for document ID.
- Reconstruct document/snippet by LF and ISA climbing ( steps). Parallelism spans query entries and within-document threads.
Resource demands at query time are minimal; all data is streamed from SSD, and random-access is mitigated via system-level I/O optimizations.
4. Practical Configuration, Mini Variants, and Trade-offs
A “mini” Infini-gram system applies additional engineering constraints:
- Index Pruning: Discard rare suffixes (e.g., count ).
- Pointer Width Reduction: Use -bit pointers when feasible (e.g., 4 bytes for a 2 GB corpus).
- Limiting Context Length: Impose on n-gram length; beyond this, backoff is used.
- Compression Substitutions: FM-index or wavelet trees can reduce size by – versus plain SA, at the cost of slower access.
- Caching: Top- frequent suffixes are cached in RAM (e.g., 200 MB for 10M suffixes).
Example: For tokens, two 500M-token shards, pointer width 4B, data pruning , index size $4.8$ GB, , and cache of top queries achieve latency of $1$–$2$ ms (cache) and up to $50$ ms (worst-case), with next-token prediction accuracy ≈ (mini) vs (full), and surpassing 5-gram LMs (29%) (Liu et al., 2024).
Empirically, pruning 50% of rare context suffixes reduces next-token accuracy by ≈15 percentage points ().
5. Case Study: Benchmark Contamination Analysis
Infini-gram mini was deployed to quantify overlap between evaluation benchmarks and Internet-scale pretraining corpora, revealing extensive “contamination” (Xu et al., 13 Jun 2025):
- Methodology: For each test entry, all length-50 substrings (stride=1 word) are checked. The contamination rate is the fraction with count . Entries are labeled “clean” (), “suspicious” (), or “dirty” ().
- Findings: SQuAD is 40.1% dirty on DCLM (2022 crawl), 2.7% on CC-2025 (2025 crawl). MMLU and ARC-Challenge contamination rates are also substantial (up to 32.6%).
- Error Analysis: Majority (58–83%) of dirty cases are exact question+answer matches. Lower rates involve paraphrasing, partial overlap, or false positives (1–3%).
A public “contamination bulletin” is maintained, with API access and community updates.
6. Limitations, Impact, and Future Directions
- Latency vs. Compression: FM-index random-access decompression leads to query times of seconds for document retrieval; this is a key trade-off versus in-memory suffix arrays (ms-level).
- Supported Queries: Only exact-match, case-sensitive queries are implemented; no co-occurrence, fuzzy, or semantic search is currently feasible.
- Benchmark Limitations: Contamination analysis is limited to exact string collision, not paraphrastic or tokenized overlaps.
- Scalability: Throughput scales linearly—complete indexing of 1 PB Common Crawl requires approximately 1200 node-days.
Planned improvements include prefetch- and SSD-aware scheduling to reduce latency, support for approximate/cross-shard pattern queries, subword/Unicode indexing, and ongoing community-driven analyses.
7. Relation to Infini-gram and Larger Landscape
Infini-gram mini is a direct evolution of the “Infini-gram” engine (Liu et al., 2024), which introduced unbounded--gram (-gram) language modeling via on-the-fly, index-backed count queries instead of explicit count tables. Infini-gram mini extends these ideas by:
- Engineering highly compressed, distributed indexes (FM-index vs. suffix array), reducing storage overhead from to corpus size.
- Enabling petabyte-scale exact search, supporting data analysis initiatives critical for training and evaluating neural LLMs.
The system’s architecture demonstrates that with optimized index compression and parallelization, exact pattern search and contamination auditing are tractable at web scale. Public code, APIs, and an open contamination bulletin are accessible at https://infini-gram-mini.io (Xu et al., 13 Jun 2025).