Infini-gram Search Engine

Updated 18 January 2026

Infini-gram Search Engine is a high-performance system for exact n-gram and ∞-gram language modeling, utilizing suffix arrays and FM-index structures.
It processes trillions of tokens efficiently, supporting precise probability evaluation and rapid substring queries for text analysis and neural LM augmentation.
The architecture combines rigorous probabilistic frameworks with scalable indexing and compression techniques to enable effective data curation and anomaly detection.

Infini-gram Search Engine refers to a family of high-performance, large-scale systems for exact $n$ -gram and $\infty$ -gram language modeling and search, centered on the Infini-gram and Infini-gram mini engines. These systems modernize classical $n$ -gram modeling to process up to trillions of tokens, supporting both efficient statistical language modeling (including arbitrarily unbounded $n$ contexts) and Internet-scale exact substring search. The Infini-gram architecture builds on full-text suffix arrays, while Infini-gram mini employs FM-index–based compression and search to extend practical scale to tens of terabytes and beyond. Both engines are motivated by the need for transparent text analysis, data curation, and neural LLM augmentation at previously infeasible scale, combining rigorous probabilistic frameworks with algorithmic innovations for efficient querying, storage, and deployment.

1. Core Data Structures and Algorithms

Infini-gram: Suffix Array Engine

The Infini-gram engine represents the tokenized corpus (5 trillion tokens) as a contiguous flat byte array, with each token assigned two bytes. All documents are delimited via a designated end-of-document marker (0xff 0xff). On this structure, a full suffix array (SA) is constructed—a dense array of length $N$ mapping each lexicographically ordered suffix to its byte offset within the corpus. The total storage is $7$ bytes per token: $2$ for the token array and $5$ for the suffix array pointer, resulting in a $35$ TB index for $5$ trillion tokens. Construction is parallelized: e.g., $\infty$ 0 trillion tokens take $\infty$ 1 hours on a $\infty$ 2-CPU, $\infty$ 3 TiB RAM cluster, while the full $\infty$ 4T tokens index in $\infty$ 5 days with $\infty$ 6 TB SSD (Liu et al., 2024).

Given an $\infty$ 7-gram $\infty$ 8, retrieval is performed by binary search in the suffix array to identify interval $\infty$ 9; query time is $n$ 0, and observed latency is dominated by $n$ 1 random-access disk operations, with count queries up to $n$ 2 feasible in $n$ 3 ms even on $n$ 4T tokens.

Infini-gram mini: FM-index Engine

Infini-gram mini compresses and indexes massive text collections using the FM-index, derived from the Burrows–Wheeler Transform (BWT) and sampled suffix/inverse suffix arrays. The input corpus $n$ 5 (bytes of length $n$ 6) is concatenated with unique separators, enabling document boundaries. The FM-index comprises:

Sampled Suffix Array (SA/ISA): Stores only every $n$ 7-th entry to reduce space; missing locations are recovered by iterating the LF-mapping.
BWT with Huffman-Shaped Wavelet Tree: Stores the permuted text, compressed to $n$ 8 bits, where $n$ 9 is zeroth-order entropy (about $n$ 0 bits/symbol).

Operations: Primitive operations are find(Q) (SA interval for pattern

n

1 in

n

2 time), locate(i) (text location for a suffix, requiring

n

3 LF steps), and `reconstruct(p,d)

n

n

).</li> </ul> <p>Index files occupy only

n$6 the corpus size. For $n$7 TB of text, indexing completes in $n$8 days on a single node or $n$9 hours using $N$0 nodes in parallel; the index is $N$1 TB (Xu et al., 13 Jun 2025).

Engine	Index Structure	Storage Overhead	Search Latency
Infini-gram	Full Suffix Array	$N$2 bytes/token	ms (length-dependent)
Infini-gram mini	FM-index (BWT+SA)	$N$3 corpus	s (text-length dependent)

2. Probability Models and Query Semantics

Classical and $N$4-gram Backoff

Traditional $N$5-gram backoff models (e.g., Katz) interpolate between fixed $N$6 and lower-order models. The $N$7-gram model, enabled by the Infini-gram engine, defines probability as:

$N$8

where $N$9 is maximal such that $7$0 for $7$1. Backoff is invoked only when context is unseen; smoothing weights $7$2 are unnecessary, ensuring immediate normalization: $7$3. For suffix array–based engines, this computes exactly via interval counts; for FM-index structures, the identical logic is applied using recursive pattern finding over compressed substrings.

Query API and Interfaces

Infini-gram exposes a REST JSON API, supporting pointwise probability evaluation as well as full next-token distributions; example request for an $7$4-gram probability:

$\infty$33

Response includes counts, normalized probability, and "effective n" (context length used). Batch distribution queries return top-$7$5 next-token probabilities, with latency for $7$6-gram distributions $7$7 200 ms on large-scale index shards.

3. System Architecture and Engineering

Infini-gram

Sharding and Parallelism: Both indexing and querying are performed in parallel, utilizing sharded suffix arrays across multiple disks and compute nodes for I/O efficiency.
Disk Layout: Token and suffix arrays reside on SSDs; queries are memory-mapped, minimizing in-RAM requirements.
Latency: On $7$8T tokens and $7$9-file shards, exemplary latencies are—count($2$0-gram): $2$1 ms; $2$2-gram distribution: $2$3 ms; $2$4-gram probability (single token): $2$5 ms; $2$6-gram distribution: $2$7 ms.

Infini-gram mini

Petabyte Scale Design: Text is sharded into $2$8–$2$9 GB chunks. Construction is split into: (1) SA+BWT; (2) symbol counting; (3) wavelet tree construction; (4) SA sampling; (5) ISA sampling.
Engineering Optimizations: In-place streaming and multithreaded algorithms deliver $5$0 speedup in construction, $5$1 reduction in peak RAM. At inference time, indices are loaded in a memory-mapped, read-only fashion, ensuring sub-2 GB RAM use.
Distributed Query: Query performance (on GCP SSDs): single-token count ($5$2) is $5$3–$5$4 ms for $5$5–$5$6 TB shards, but scales to several seconds for $5$7 due to I/O for scattered compressed substrings.

4. Empirical Performance and Comparative Evaluation

Language Modeling

Infini-gram ($5$8-gram LM): On held-out text, $5$9-gram LM achieves $35$0 next-token accuracy; $35$1-gram achieves $35$2 overall—rising above $35$3 for context $35$4 and $35$5 in sparse cases.
Neural LM Augmentation: Interpolating LLaMA-2 $35$6B neural LM with $35$7-gram reduces perplexity from $35$8 to $35$9 ($5$0); $5$1B LLaMA-2: $5$2 to $5$3 on $5$4T-token data, demonstrating nonparametric benefits (Liu et al., 2024).

Benchmark and Corpus Analysis

Infini-gram mini Contamination Analysis: Benchmark contamination metric $5$5—fraction of overlapping $5$6-character substrings found in the training corpus—reveals “dirty” rates as high as $5$7 (MMLU), $5$8 (ARC-Challenge), and $5$9 (SQuAD) on corpora up to $\infty$00 TB. Most “dirty” entries are full Q&A exact matches (up to $\infty$01).
Query Throughput: On medium shard sizes ($\infty$02–$\infty$03 TB), count queries for $\infty$04-$\infty$05 take $\infty$06 ms–$\infty$07 s; document retrieval for spans up to $\infty$08 bytes is $\infty$09–$\infty$10 s (Xu et al., 13 Jun 2025).

Comparison to Prior Systems

System	Indexable Tokens (approx.)	Storage per Token or Corpus	$\infty$11 Supported	Query Latency
Google Books 5-gram	$\infty$12	$\infty$13 GB for $\infty$14-gram only	$\infty$15	—
Suffix-tree LM	$\infty$16	$\infty$17 GB RAM	$\infty$18 unbounded	—
Nearest-neighbor LM	$\infty$19	$\infty$20 TB (vector index)	limited	—
Infini-gram	$\infty$21	$\infty$22 TB	$\infty$23	ms
Infini-gram mini	$\infty$24	$\infty$25 TB ($\infty$26)	$\infty$27	s

5. Applications

Large-scale Text Analysis: Enables corpus inspection, such as quantifying $\infty$28-gram frequencies and rare sequence identification.
Data Curation and Decontamination: Powering tools like SearchDoc for complex Boolean queries (CNF) over $\infty$29-grams, supporting removal of toxic or sensitive material.
Neural LM Augmentation: $\infty$30-gram model serves as a nonparametric memory, lowering neural model perplexity without GPU access.
Anomaly Detection: Agreement curve analysis between LMs and $\infty$31-gram models uncovers memorization and positional embedding artifacts in output from transformer-based models.
Web Interface and Programmatic API: Offers interactive and programmatic access for both general substring querying and benchmarking contamination rates (see https://infini-gram-mini.io/demo and https://api.infini-gram-mini.io) (Xu et al., 13 Jun 2025).

6. Limitations and Future Directions

Identified Limitations

Latency: Infini-gram (suffix array) supports millisecond latency; Infini-gram mini (FM-index) has seconds-level query time for long substrings or document retrieval due to decompression and scattered storage access.
Query Expressivity: Only exact, case-sensitive byte matching is supported; no semantic, fuzzy, or edit-distance–tolerant search.
Co-occurrence and Boolean Matching: Multi-pattern and co-occurrence search are inefficient, requiring serial location recovery in the suffix array.
Scalability: Petabyte-scale indexing is feasible but increases in random I/O and network sync demand further system optimizations.

Prospective Improvements

Disk-page prefetching and batching of LF/rank operations to hide I/O latency.
Hybrid “hot index” caches for common substrings.
Support for multi-pattern and Boolean queries via auxiliary compressed posting lists.
Enhanced approximate/fuzzy search using generalized suffix automata or edit-distance–aware FM-index extensions.
Hardware offloading (GPU/FPGA) for intensive LF-mapping phases.

7. Context and Significance in NLP

Infini-gram and its successor Infini-gram mini mark a departure from traditional $\infty$ 32-gram precomputation or neural-only retrieval by enabling exact, arbitrarily long pattern matching and corpus-scale likelihood estimation. This supports both theoretical investigation—such as revealing deficiencies in transformer positional encodings and benchmark contamination—and practical operations, including corpus curation and LLM debugging. The architectural shift to suffix and FM-index–based designs establishes new performance and scalability frontiers for nonparametric search engines and transparent statistical language modeling (Liu et al., 2024, Xu et al., 13 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens (2024)

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Infini-gram Search Engine.