Papers
Topics
Authors
Recent
Search
2000 character limit reached

Infini-gram Search Engine

Updated 18 January 2026
  • Infini-gram Search Engine is a high-performance system for exact n-gram and ∞-gram language modeling, utilizing suffix arrays and FM-index structures.
  • It processes trillions of tokens efficiently, supporting precise probability evaluation and rapid substring queries for text analysis and neural LM augmentation.
  • The architecture combines rigorous probabilistic frameworks with scalable indexing and compression techniques to enable effective data curation and anomaly detection.

Infini-gram Search Engine refers to a family of high-performance, large-scale systems for exact nn-gram and \infty-gram language modeling and search, centered on the Infini-gram and Infini-gram mini engines. These systems modernize classical nn-gram modeling to process up to trillions of tokens, supporting both efficient statistical language modeling (including arbitrarily unbounded nn contexts) and Internet-scale exact substring search. The Infini-gram architecture builds on full-text suffix arrays, while Infini-gram mini employs FM-index–based compression and search to extend practical scale to tens of terabytes and beyond. Both engines are motivated by the need for transparent text analysis, data curation, and neural LLM augmentation at previously infeasible scale, combining rigorous probabilistic frameworks with algorithmic innovations for efficient querying, storage, and deployment.

1. Core Data Structures and Algorithms

Infini-gram: Suffix Array Engine

The Infini-gram engine represents the tokenized corpus (5 trillion tokens) as a contiguous flat byte array, with each token assigned two bytes. All documents are delimited via a designated end-of-document marker (0xff 0xff). On this structure, a full suffix array (SA) is constructed—a dense array of length NN mapping each lexicographically ordered suffix to its byte offset within the corpus. The total storage is $7$ bytes per token: $2$ for the token array and $5$ for the suffix array pointer, resulting in a $35$ TB index for $5$ trillion tokens. Construction is parallelized: e.g., $1.4$ trillion tokens take 48\sim 48 hours on a $128$-CPU, $1$ TiB RAM cluster, while the full $5$T tokens index in 2\sim 2 days with 10\sim 10 TB SSD (Liu et al., 2024).

Given an nn-gram Q=x1,...,xLQ = x_1, ..., x_L, retrieval is performed by binary search in the suffix array to identify interval [,r)[\ell, r); query time is O(L+logN)O(L + \log N), and observed latency is dominated by logN\sim \log N random-access disk operations, with count queries up to Q=1000|Q| = 1000 feasible in $20$ ms even on $1.4$T tokens.

Infini-gram mini: FM-index Engine

Infini-gram mini compresses and indexes massive text collections using the FM-index, derived from the Burrows–Wheeler Transform (BWT) and sampled suffix/inverse suffix arrays. The input corpus TT (bytes of length nn) is concatenated with unique separators, enabling document boundaries. The FM-index comprises:

  • Sampled Suffix Array (SA/ISA): Stores only every aa-th entry to reduce space; missing locations are recovered by iterating the LF-mapping.
  • BWT with Huffman-Shaped Wavelet Tree: Stores the permuted text, compressed to nH0+2σlognn H_0 + 2\sigma\log n bits, where H0H_0 is zeroth-order entropy (about $2.1$ bits/symbol).
  • Operations: Primitive operations are find(Q) (SA interval for pattern QQ in O(QH0)O(|Q| H_0) time), locate(i) (text location for a suffix, requiring O(aH0)O(a H_0) LF steps), and `reconstruct(p,d)(recoversubstringoflength(recover substring of lengthdatpositionat positionp).</li></ul><p>Indexfilesoccupyonly).</li> </ul> <p>Index files occupy only 0.44\timesthecorpussize.For the corpus size. For 46TBoftext,indexingcompletesin TB of text, indexing completes in 50daysonasinglenodeor days on a single node or 19hoursusing hours using 75nodesinparallel;theindexis nodes in parallel; the index is 20.1TB(<ahref="/papers/2506.12229"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Xuetal.,13Jun2025</a>).</p><divclass=overflowxautomaxwfullmy4><tableclass=tablebordercollapsewfullstyle=tablelayout:fixed><thead><tr><th>Engine</th><th>IndexStructure</th><th>StorageOverhead</th><th>SearchLatency</th></tr></thead><tbody><tr><td>Infinigram</td><td>FullSuffixArray</td><td> TB (<a href="/papers/2506.12229" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Xu et al., 13 Jun 2025</a>).</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Engine</th> <th>Index Structure</th> <th>Storage Overhead</th> <th>Search Latency</th> </tr> </thead><tbody><tr> <td>Infini-gram</td> <td>Full Suffix Array</td> <td>7bytes/token</td><td>ms(lengthdependent)</td></tr><tr><td>Infinigrammini</td><td>FMindex(BWT+SA)</td><td> bytes/token</td> <td>ms (length-dependent)</td> </tr> <tr> <td>Infini-gram mini</td> <td>FM-index (BWT+SA)</td> <td>0.44\timescorpus</td><td>s(textlengthdependent)</td></tr></tbody></table></div><h2class=paperheadingid=probabilitymodelsandquerysemantics>2.ProbabilityModelsandQuerySemantics</h2><h3class=paperheadingid=classicaland44grambackoff>Classicaland corpus</td> <td>s (text-length dependent)</td> </tr> </tbody></table></div><h2 class='paper-heading' id='probability-models-and-query-semantics'>2. Probability Models and Query Semantics</h2><h3 class='paper-heading' id='classical-and-44-gram-backoff'>Classical and \inftygramBackoff</h3><p>Traditional-gram Backoff</h3> <p>Traditional ngrambackoffmodels(e.g.,Katz)interpolatebetweenfixed-gram backoff models (e.g., Katz) interpolate between fixed nandlowerordermodels.The and lower-order models. The \inftygrammodel,enabledbytheInfinigramengine,definesprobabilityas:</p><p>-gram model, enabled by the Infini-gram engine, defines probability as:</p> <p>P_\infty(w_i \mid w_{1...i-1}) = \frac{\text{cnt}(w_{i-n+1...i-1} \circ w_i)}{\text{cnt}(w_{i-n+1...i-1})}</p><p>where</p> <p>where nismaximalsuchthat is maximal such that \text{cnt}(w_{i-n'+1...i-1}) > 0for for n' \le i.Backoffisinvokedonlywhencontextisunseen;smoothingweights. Backoff is invoked only when context is unseen; smoothing weights \alphaareunnecessary,ensuringimmediatenormalization: are unnecessary, ensuring immediate normalization: \sum_w P_\infty(w \mid h) = 1.Forsuffixarraybasedengines,thiscomputesexactlyviaintervalcounts;forFMindexstructures,theidenticallogicisappliedusingrecursivepatternfindingovercompressedsubstrings.</p><h3class=paperheadingid=queryapiandinterfaces>QueryAPIandInterfaces</h3><p>InfinigramexposesaRESTJSON<ahref="https://www.emergentmind.com/topics/geospatialapplicationprogramminginterfaceapi"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">API</a>,supportingpointwiseprobabilityevaluationaswellasfullnexttokendistributions;examplerequestforan. For suffix array–based engines, this computes exactly via interval counts; for FM-index structures, the identical logic is applied using recursive pattern finding over compressed substrings.</p> <h3 class='paper-heading' id='query-api-and-interfaces'>Query API and Interfaces</h3> <p>Infini-gram exposes a REST JSON <a href="https://www.emergentmind.com/topics/geospatial-application-programming-interface-api" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">API</a>, supporting pointwise probability evaluation as well as full next-token distributions; example request for an \infty$-gram probability:</p> <p>
    1
    2
    3
    4
    5
    6
    
    POST /api/query
    {
      "type": "infgram_prob",
      "context_ids": [14, 502, 17, ... ],
      "token_id": 137
    }
    </p> <p>Response includes counts, normalized probability, and &quot;effective n&quot; (context length used). Batch distribution queries return top-$Knexttokenprobabilities,withlatencyfor next-token probabilities, with latency for \inftygramdistributions-gram distributions \lesssim$ 200 ms on large-scale index shards.

    3. System Architecture and Engineering

    Infini-gram

    • Sharding and Parallelism: Both indexing and querying are performed in parallel, utilizing sharded suffix arrays across multiple disks and compute nodes for I/O efficiency.
    • Disk Layout: Token and suffix arrays reside on SSDs; queries are memory-mapped, minimizing in-RAM requirements.
    • Latency: On $1.4TtokensandT tokens and 8fileshards,exemplarylatenciesarecount(-file shards, exemplary latencies are—count(ngram):-gram): 18ms; ms; 5gramdistribution:-gram distribution: 39ms; ms; \inftygramprobability(singletoken):-gram probability (single token): 135ms; ms; \inftygramdistribution:-gram distribution: 180ms.</li></ul><h3class=paperheadingid=infinigrammini>Infinigrammini</h3><ul><li><strong>PetabyteScaleDesign:</strong>Textisshardedinto ms.</li> </ul> <h3 class='paper-heading' id='infini-gram-mini'>Infini-gram mini</h3> <ul> <li><strong>Petabyte Scale Design:</strong> Text is sharded into \sim 600700GBchunks.Constructionissplitinto:(1)SA+BWT;(2)symbolcounting;(3)wavelettreeconstruction;(4)SAsampling;(5)ISAsampling.</li><li><strong>EngineeringOptimizations:</strong>Inplacestreamingandmultithreadedalgorithmsdeliver GB chunks. Construction is split into: (1) SA+BWT; (2) symbol counting; (3) wavelet tree construction; (4) SA sampling; (5) ISA sampling.</li> <li><strong>Engineering Optimizations:</strong> In-place streaming and multithreaded algorithms deliver 18\timesspeedupinconstruction, speedup in construction, 3.2\timesreductioninpeakRAM.Atinferencetime,indicesareloadedinamemorymapped,readonlyfashion,ensuringsub2GBRAMuse.</li><li><strong>DistributedQuery:</strong>Queryperformance(onGCPSSDs):singletokencount( reduction in peak RAM. At inference time, indices are loaded in a memory-mapped, read-only fashion, ensuring sub-2 GB RAM use.</li> <li><strong>Distributed Query:</strong> Query performance (on GCP SSDs): single-token count (|Q|=1)is) is 432msfor ms for 19TBshards,butscalestoseveralsecondsfor TB shards, but scales to several seconds for |Q|=1000duetoI/Oforscatteredcompressedsubstrings.</li></ul><h2class=paperheadingid=empiricalperformanceandcomparativeevaluation>4.EmpiricalPerformanceandComparativeEvaluation</h2><h3class=paperheadingid=languagemodeling>LanguageModeling</h3><ul><li><strong>Infinigram( due to I/O for scattered compressed substrings.</li> </ul> <h2 class='paper-heading' id='empirical-performance-and-comparative-evaluation'>4. Empirical Performance and Comparative Evaluation</h2><h3 class='paper-heading' id='language-modeling'>Language Modeling</h3> <ul> <li><strong>Infini-gram (\inftygramLM):</strong>Onheldouttext,-gram LM):</strong> On held-out text, 5gramLMachieves-gram LM achieves 29\%nexttokenaccuracy; next-token accuracy; \inftygramachieves-gram achieves 47\%overallrisingabove overall—rising above 75\%forcontext for context n \ge 16and and 80\%insparsecases.</li><li><strong>NeuralLMAugmentation:</strong>InterpolatingLLaMA2 in sparse cases.</li> <li><strong>Neural LM Augmentation:</strong> Interpolating LLaMA-2 13BneuralLMwithB neural LM with \inftygramreducesperplexityfrom-gram reduces perplexity from 5.30to to 4.41( (-21\%);); 70BLLaMA2:B LLaMA-2: 4.59to to 3.96on on 1.8Ttokendata,demonstratingnonparametricbenefits(<ahref="/papers/2401.17377"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Liuetal.,2024</a>).</li></ul><h3class=paperheadingid=benchmarkandcorpusanalysis>BenchmarkandCorpusAnalysis</h3><ul><li><strong>InfinigramminiContaminationAnalysis:</strong>BenchmarkcontaminationmetricT-token data, demonstrating nonparametric benefits (<a href="/papers/2401.17377" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Liu et al., 2024</a>).</li> </ul> <h3 class='paper-heading' id='benchmark-and-corpus-analysis'>Benchmark and Corpus Analysis</h3> <ul> <li><strong>Infini-gram mini Contamination Analysis:</strong> Benchmark contamination metric \etafractionofoverlapping—fraction of overlapping 50charactersubstringsfoundinthetrainingcorpusrevealsdirtyratesashighas-character substrings found in the training corpus—reveals “dirty” rates as high as 27.7\%(MMLU), (MMLU), 32.6\%(ARCChallenge),and (ARC-Challenge), and 40.1\%(SQuAD)oncorporaupto (SQuAD) on corpora up to 16.7$ TB. Most “dirty” entries are full Q&amp;A exact matches (up to $83\%).</li><li><strong>QueryThroughput:</strong>Onmediumshardsizes().</li> <li><strong>Query Throughput:</strong> On medium shard sizes (\sim 19TB),countqueriesfor TB), count queries for |Q|=1-1000take take 4ms ms–25s;documentretrievalforspansupto s; document retrieval for spans up to 3000bytesis bytes is 0.434.46s(<ahref="/papers/2506.12229"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Xuetal.,13Jun2025</a>).</li></ul><h3class=paperheadingid=comparisontopriorsystems>ComparisontoPriorSystems</h3><divclass=overflowxautomaxwfullmy4><tableclass=tablebordercollapsewfullstyle=tablelayout:fixed><thead><tr><th>System</th><th>IndexableTokens(approx.)</th><th>StorageperTokenorCorpus</th><th> s (<a href="/papers/2506.12229" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Xu et al., 13 Jun 2025</a>).</li> </ul> <h3 class='paper-heading' id='comparison-to-prior-systems'>Comparison to Prior Systems</h3><div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>System</th> <th>Indexable Tokens (approx.)</th> <th>Storage per Token or Corpus</th> <th>nSupported</th><th>QueryLatency</th></tr></thead><tbody><tr><td>GoogleBooks5gram</td><td> Supported</th> <th>Query Latency</th> </tr> </thead><tbody><tr> <td>Google Books 5-gram</td> <td>5 \times 10^{11}</td><td></td> <td>24GBfor GB for 5gramonly</td><td>-gram only</td> <td>n=5</td><td></td></tr><tr><td>SuffixtreeLM</td><td></td> <td>—</td> </tr> <tr> <td>Suffix-tree LM</td> <td>9 \times 10^{9}</td><td></td> <td>63GBRAM</td><td> GB RAM</td> <td>nunbounded</td><td></td></tr><tr><td>NearestneighborLM</td><td> unbounded</td> <td>—</td> </tr> <tr> <td>Nearest-neighbor LM</td> <td>2.8 \times 10^{10}</td><td></td> <td>432TB(vectorindex)</td><td>limited</td><td></td></tr><tr><td>Infinigram</td><td> TB (vector index)</td> <td>limited</td> <td>—</td> </tr> <tr> <td>Infini-gram</td> <td>5 \times 10^{12}</td><td></td> <td>35TB</td><td> TB</td> <td>n=\infty</td><td>ms</td></tr><tr><td>Infinigrammini</td><td></td> <td>ms</td> </tr> <tr> <td>Infini-gram mini</td> <td>4.6 \times 10^{13}</td><td></td> <td>20.1TB( TB (0.44\times)</td><td>)</td> <td>n=\infty</td><td>s</td></tr></tbody></table></div><h2class=paperheadingid=applications>5.Applications</h2><ul><li><strong>LargescaleTextAnalysis:</strong>Enablescorpusinspection,suchasquantifying</td> <td>s</td> </tr> </tbody></table></div><h2 class='paper-heading' id='applications'>5. Applications</h2> <ul> <li><strong>Large-scale Text Analysis:</strong> Enables corpus inspection, such as quantifying ngramfrequenciesandraresequenceidentification.</li><li><strong>DataCurationandDecontamination:</strong>PoweringtoolslikeSearchDocforcomplexBooleanqueries(CNF)over-gram frequencies and rare sequence identification.</li> <li><strong>Data Curation and Decontamination:</strong> Powering tools like SearchDoc for complex Boolean queries (CNF) over ngrams,supportingremovaloftoxicorsensitivematerial.</li><li><strong>NeuralLMAugmentation:</strong>-grams, supporting removal of toxic or sensitive material.</li> <li><strong>Neural LM Augmentation:</strong> \inftygrammodelservesasanonparametricmemory,loweringneuralmodelperplexitywithoutGPUaccess.</li><li><strong>AnomalyDetection:</strong>AgreementcurveanalysisbetweenLMsand-gram model serves as a nonparametric memory, lowering neural model perplexity without GPU access.</li> <li><strong>Anomaly Detection:</strong> Agreement curve analysis between LMs and \infty$-gram models uncovers memorization and positional embedding artifacts in output from transformer-based models.
    • Web Interface and Programmatic API: Offers interactive and programmatic access for both general substring querying and benchmarking contamination rates (see https://infini-gram-mini.io/demo and https://api.infini-gram-mini.io) (Xu et al., 13 Jun 2025).

    6. Limitations and Future Directions

    Identified Limitations

    • Latency: Infini-gram (suffix array) supports millisecond latency; Infini-gram mini (FM-index) has seconds-level query time for long substrings or document retrieval due to decompression and scattered storage access.
    • Query Expressivity: Only exact, case-sensitive byte matching is supported; no semantic, fuzzy, or edit-distance–tolerant search.
    • Co-occurrence and Boolean Matching: Multi-pattern and co-occurrence search are inefficient, requiring serial location recovery in the suffix array.
    • Scalability: Petabyte-scale indexing is feasible but increases in random I/O and network sync demand further system optimizations.

    Prospective Improvements

    • Disk-page prefetching and batching of LF/rank operations to hide I/O latency.
    • Hybrid “hot index” caches for common substrings.
    • Support for multi-pattern and Boolean queries via auxiliary compressed posting lists.
    • Enhanced approximate/fuzzy search using generalized suffix automata or edit-distance–aware FM-index extensions.
    • Hardware offloading (GPU/FPGA) for intensive LF-mapping phases.

    7. Context and Significance in NLP

    Infini-gram and its successor Infini-gram mini mark a departure from traditional nn-gram precomputation or neural-only retrieval by enabling exact, arbitrarily long pattern matching and corpus-scale likelihood estimation. This supports both theoretical investigation—such as revealing deficiencies in transformer positional encodings and benchmark contamination—and practical operations, including corpus curation and LLM debugging. The architectural shift to suffix and FM-index–based designs establishes new performance and scalability frontiers for nonparametric search engines and transparent statistical language modeling (Liu et al., 2024, Xu et al., 13 Jun 2025).

    Definition Search Book Streamline Icon: https://streamlinehq.com
    References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Infini-gram Search Engine.