Infini-gram Search Engine is a high-performance system for exact n-gram and ∞-gram language modeling, utilizing suffix arrays and FM-index structures.
It processes trillions of tokens efficiently, supporting precise probability evaluation and rapid substring queries for text analysis and neural LM augmentation.
The architecture combines rigorous probabilistic frameworks with scalable indexing and compression techniques to enable effective data curation and anomaly detection.
Infini-gram Search Engine refers to a family of high-performance, large-scale systems for exact n-gram and ∞-gram language modeling and search, centered on the Infini-gram and Infini-gram mini engines. These systems modernize classical n-gram modeling to process up to trillions of tokens, supporting both efficient statistical language modeling (including arbitrarily unbounded n contexts) and Internet-scale exact substring search. The Infini-gram architecture builds on full-text suffix arrays, while Infini-gram mini employs FM-index–based compression and search to extend practical scale to tens of terabytes and beyond. Both engines are motivated by the need for transparent text analysis, data curation, and neural LLM augmentation at previously infeasible scale, combining rigorous probabilistic frameworks with algorithmic innovations for efficient querying, storage, and deployment.
1. Core Data Structures and Algorithms
Infini-gram: Suffix Array Engine
The Infini-gram engine represents the tokenized corpus (5 trillion tokens) as a contiguous flat byte array, with each token assigned two bytes. All documents are delimited via a designated end-of-document marker (0xff 0xff). On this structure, a full suffix array (SA) is constructed—a dense array of length N mapping each lexicographically ordered suffix to its byte offset within the corpus. The total storage is $7$ bytes per token: $2$ for the token array and $5$ for the suffix array pointer, resulting in a $35$ TB index for $5$ trillion tokens. Construction is parallelized: e.g., $1.4$ trillion tokens take ∼48 hours on a $128$-CPU, $1$ TiB RAM cluster, while the full $5$T tokens index in ∼2 days with ∼10 TB SSD (Liu et al., 2024).
Given an n-gram Q=x1,...,xL, retrieval is performed by binary search in the suffix array to identify interval [ℓ,r); query time is O(L+logN), and observed latency is dominated by ∼logN random-access disk operations, with count queries up to ∣Q∣=1000 feasible in $20$ ms even on $1.4$T tokens.
Infini-gram mini: FM-index Engine
Infini-gram mini compresses and indexes massive text collections using the FM-index, derived from the Burrows–Wheeler Transform (BWT) and sampled suffix/inverse suffix arrays. The input corpus T (bytes of length n) is concatenated with unique separators, enabling document boundaries. The FM-index comprises:
Sampled Suffix Array (SA/ISA): Stores only every a-th entry to reduce space; missing locations are recovered by iterating the LF-mapping.
BWT with Huffman-Shaped Wavelet Tree: Stores the permuted text, compressed to nH0+2σlogn bits, where H0 is zeroth-order entropy (about $2.1$ bits/symbol).
Operations: Primitive operations are find(Q) (SA interval for pattern Q in O(∣Q∣H0) time), locate(i) (text location for a suffix, requiring O(aH0) LF steps), and `reconstruct(p,d)(recoversubstringoflengthdatpositionp).</li></ul><p>Indexfilesoccupyonly0.44\timesthecorpussize.For46TBoftext,indexingcompletesin50daysonasinglenodeor19hoursusing75nodesinparallel;theindexis20.1TB(<ahref="/papers/2506.12229"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Xuetal.,13Jun2025</a>).</p><divclass=′overflow−x−automax−w−fullmy−4′><tableclass=′tableborder−collapsew−full′style=′table−layout:fixed′><thead><tr><th>Engine</th><th>IndexStructure</th><th>StorageOverhead</th><th>SearchLatency</th></tr></thead><tbody><tr><td>Infini−gram</td><td>FullSuffixArray</td><td>7bytes/token</td><td>ms(length−dependent)</td></tr><tr><td>Infini−grammini</td><td>FM−index(BWT+SA)</td><td>0.44\timescorpus</td><td>s(text−lengthdependent)</td></tr></tbody></table></div><h2class=′paper−heading′id=′probability−models−and−query−semantics′>2.ProbabilityModelsandQuerySemantics</h2><h3class=′paper−heading′id=′classical−and−44−gram−backoff′>Classicaland\infty−gramBackoff</h3><p>Traditionaln−grambackoffmodels(e.g.,Katz)interpolatebetweenfixednandlower−ordermodels.The\infty−grammodel,enabledbytheInfini−gramengine,definesprobabilityas:</p><p>P_\infty(w_i \mid w_{1...i-1}) = \frac{\text{cnt}(w_{i-n+1...i-1} \circ w_i)}{\text{cnt}(w_{i-n+1...i-1})}</p><p>wherenismaximalsuchthat\text{cnt}(w_{i-n'+1...i-1}) > 0forn' \le i.Backoffisinvokedonlywhencontextisunseen;smoothingweights\alphaareunnecessary,ensuringimmediatenormalization:\sum_w P_\infty(w \mid h) = 1.Forsuffixarray–basedengines,thiscomputesexactlyviaintervalcounts;forFM−indexstructures,theidenticallogicisappliedusingrecursivepatternfindingovercompressedsubstrings.</p><h3class=′paper−heading′id=′query−api−and−interfaces′>QueryAPIandInterfaces</h3><p>Infini−gramexposesaRESTJSON<ahref="https://www.emergentmind.com/topics/geospatial−application−programming−interface−api"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">API</a>,supportingpointwiseprobabilityevaluationaswellasfullnext−tokendistributions;examplerequestforan\infty$-gram probability:</p>
<p>
</p>
<p>Response includes counts, normalized probability, and "effective n" (context length used). Batch distribution queries return top-$Knext−tokenprobabilities,withlatencyfor\infty−gramdistributions\lesssim$ 200 ms on large-scale index shards.
3. System Architecture and Engineering
Infini-gram
Sharding and Parallelism: Both indexing and querying are performed in parallel, utilizing sharded suffix arrays across multiple disks and compute nodes for I/O efficiency.
Disk Layout: Token and suffix arrays reside on SSDs; queries are memory-mapped, minimizing in-RAM requirements.
Latency: On $1.4Ttokensand8−fileshards,exemplarylatenciesare—count(n−gram):18ms;5−gramdistribution:39ms;\infty−gramprobability(singletoken):135ms;\infty−gramdistribution:180ms.</li></ul><h3class=′paper−heading′id=′infini−gram−mini′>Infini−grammini</h3><ul><li><strong>PetabyteScaleDesign:</strong>Textisshardedinto\sim 600–700GBchunks.Constructionissplitinto:(1)SA+BWT;(2)symbolcounting;(3)wavelettreeconstruction;(4)SAsampling;(5)ISAsampling.</li><li><strong>EngineeringOptimizations:</strong>In−placestreamingandmultithreadedalgorithmsdeliver18\timesspeedupinconstruction,3.2\timesreductioninpeakRAM.Atinferencetime,indicesareloadedinamemory−mapped,read−onlyfashion,ensuringsub−2GBRAMuse.</li><li><strong>DistributedQuery:</strong>Queryperformance(onGCPSSDs):single−tokencount(|Q|=1)is4–32msfor1–9TBshards,butscalestoseveralsecondsfor|Q|=1000duetoI/Oforscatteredcompressedsubstrings.</li></ul><h2class=′paper−heading′id=′empirical−performance−and−comparative−evaluation′>4.EmpiricalPerformanceandComparativeEvaluation</h2><h3class=′paper−heading′id=′language−modeling′>LanguageModeling</h3><ul><li><strong>Infini−gram(\infty−gramLM):</strong>Onheld−outtext,5−gramLMachieves29\%next−tokenaccuracy;\infty−gramachieves47\%overall—risingabove75\%forcontextn \ge 16and80\%insparsecases.</li><li><strong>NeuralLMAugmentation:</strong>InterpolatingLLaMA−213BneuralLMwith\infty−gramreducesperplexityfrom5.30to4.41(-21\%);70BLLaMA−2:4.59to3.96on1.8T−tokendata,demonstratingnonparametricbenefits(<ahref="/papers/2401.17377"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Liuetal.,2024</a>).</li></ul><h3class=′paper−heading′id=′benchmark−and−corpus−analysis′>BenchmarkandCorpusAnalysis</h3><ul><li><strong>Infini−gramminiContaminationAnalysis:</strong>Benchmarkcontaminationmetric\eta—fractionofoverlapping50−charactersubstringsfoundinthetrainingcorpus—reveals“dirty”ratesashighas27.7\%(MMLU),32.6\%(ARC−Challenge),and40.1\%(SQuAD)oncorporaupto16.7$ TB. Most “dirty” entries are full Q&A exact matches (up to $83\%).</li><li><strong>QueryThroughput:</strong>Onmediumshardsizes(\sim 1–9TB),countqueriesfor|Q|=1−1000take4ms–25s;documentretrievalforspansupto3000bytesis0.43–4.46s(<ahref="/papers/2506.12229"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Xuetal.,13Jun2025</a>).</li></ul><h3class=′paper−heading′id=′comparison−to−prior−systems′>ComparisontoPriorSystems</h3><divclass=′overflow−x−automax−w−fullmy−4′><tableclass=′tableborder−collapsew−full′style=′table−layout:fixed′><thead><tr><th>System</th><th>IndexableTokens(approx.)</th><th>StorageperTokenorCorpus</th><th>nSupported</th><th>QueryLatency</th></tr></thead><tbody><tr><td>GoogleBooks5−gram</td><td>5 \times 10^{11}</td><td>24GBfor5−gramonly</td><td>n=5</td><td>—</td></tr><tr><td>Suffix−treeLM</td><td>9 \times 10^{9}</td><td>63GBRAM</td><td>nunbounded</td><td>—</td></tr><tr><td>Nearest−neighborLM</td><td>2.8 \times 10^{10}</td><td>432TB(vectorindex)</td><td>limited</td><td>—</td></tr><tr><td>Infini−gram</td><td>5 \times 10^{12}</td><td>35TB</td><td>n=\infty</td><td>ms</td></tr><tr><td>Infini−grammini</td><td>4.6 \times 10^{13}</td><td>20.1TB(0.44\times)</td><td>n=\infty</td><td>s</td></tr></tbody></table></div><h2class=′paper−heading′id=′applications′>5.Applications</h2><ul><li><strong>Large−scaleTextAnalysis:</strong>Enablescorpusinspection,suchasquantifyingn−gramfrequenciesandraresequenceidentification.</li><li><strong>DataCurationandDecontamination:</strong>PoweringtoolslikeSearchDocforcomplexBooleanqueries(CNF)overn−grams,supportingremovaloftoxicorsensitivematerial.</li><li><strong>NeuralLMAugmentation:</strong>\infty−grammodelservesasanonparametricmemory,loweringneuralmodelperplexitywithoutGPUaccess.</li><li><strong>AnomalyDetection:</strong>AgreementcurveanalysisbetweenLMsand\infty$-gram models uncovers memorization and positional embedding artifacts in output from transformer-based models.
Latency: Infini-gram (suffix array) supports millisecond latency; Infini-gram mini (FM-index) has seconds-level query time for long substrings or document retrieval due to decompression and scattered storage access.
Query Expressivity: Only exact, case-sensitive byte matching is supported; no semantic, fuzzy, or edit-distance–tolerant search.
Co-occurrence and Boolean Matching: Multi-pattern and co-occurrence search are inefficient, requiring serial location recovery in the suffix array.
Scalability: Petabyte-scale indexing is feasible but increases in random I/O and network sync demand further system optimizations.
Prospective Improvements
Disk-page prefetching and batching of LF/rank operations to hide I/O latency.
Support for multi-pattern and Boolean queries via auxiliary compressed posting lists.
Enhanced approximate/fuzzy search using generalized suffix automata or edit-distance–aware FM-index extensions.
Hardware offloading (GPU/FPGA) for intensive LF-mapping phases.
7. Context and Significance in NLP
Infini-gram and its successor Infini-gram mini mark a departure from traditional n-gram precomputation or neural-only retrieval by enabling exact, arbitrarily long pattern matching and corpus-scale likelihood estimation. This supports both theoretical investigation—such as revealing deficiencies in transformer positional encodings and benchmark contamination—and practical operations, including corpus curation and LLM debugging. The architectural shift to suffix and FM-index–based designs establishes new performance and scalability frontiers for nonparametric search engines and transparent statistical language modeling (Liu et al., 2024, Xu et al., 13 Jun 2025).
“Emergent Mind helps me see which AI papers have caught fire online.”
Philip
Creator, AI Explained on YouTube
Sign up for free to explore the frontiers of research
Discover trending papers, chat with arXiv, and track the latest research shaping the future of science and technology.Discover trending papers, chat with arXiv, and more.