Papers
Topics
Authors
Recent
Search
2000 character limit reached

Disk-Aware Staged Suffix Arrays

Updated 16 February 2026
  • Disk-Aware Staged Suffix Arrays are two-tier data structures that split search between a compact in-memory index and a compressed on-disk store to minimize random-access I/O.
  • They use a sampling vector or condensed BWT for rapid in-memory bracketing, followed by efficient sequential disk block inspection to confirm pattern matches.
  • Empirical evaluations show up to 33× speedup over traditional methods, making them essential for scalable, soft query matching on terabyte-scale corpora.

A disk-aware staged suffix array is a two-level (RAM and disk) data structure for scalable pattern search on massive corpora. It refines the classical suffix array by staging its components to minimize random-access I/O on disk, enabling millisecond search latency over terabyte- to trillion-token text collections with RAM footprints far below the corpus size. This approach is central to algorithms such as SoftMatcha 2 and RoSA, which leverage it to enable flexible matching—including semantic or “soft” queries—at scales where traditional FM-index or uniform-sampled suffix arrays are impractical (Yoneda et al., 11 Feb 2026, Gog et al., 2013).

1. Structural Overview

Disk-aware staged suffix arrays realize a separation between:

  • A compact in-memory index that rapidly restricts the locus of possible matches.
  • A compressed on-disk store of the bulk of suffix-array data, organized for efficient block-oriented access.

Key Components

In-Memory Component

  • Stores a sparse “sampling” array (SoftMatcha 2) or a condensed Burrows-Wheeler Transform (BWT) index (RoSA).
  • Maps patterns to block-level regions of the suffix array with logarithmic or near-linear time in the RAM footprint.
  • Designed to fit within a small fraction of the corpus size, often 1–3% in practice.

On-Disk Component

  • Holds the main lexicographically sorted array of pointers to corpus positions (suffix array) in compressed blocks.
  • Block boundaries are determined either by uniform sampling (every B’ entries) (Yoneda et al., 11 Feb 2026) or by variable-length prefix-based partitioning (RoSA) (Gog et al., 2013).
  • Run-length encoding and block reductions are employed to save space by eliminating redundancy.

This separation ensures that only a single or a small number of disk blocks are accessed per query, each transfer being page-aligned and thus efficient.

2. Query Protocol and Algorithms

Search proceeds in two principal stages:

Stage 1: In-Memory Bracketing

  • The query pattern is mapped to a small interval in the suffix array, as delineated by the sampling array or by BWT-based backward search.
  • In SoftMatcha 2, a binary search over a RAM-based sampling vector finds the bracket of size ≈B’ in O(log(C/B))O(\log(|C|/B’)) comparisons, where C|C| is corpus size (Yoneda et al., 11 Feb 2026).
  • In RoSA, condensed BWT supports backward search yielding the deepest matching prefix among disk blocks, typically in O(mlogσ)O(m\log\sigma) time for pattern length mm (Gog et al., 2013).

Stage 2: Disk Block Inspection

  • A single on-disk block (≤B’ or ≤b entries) spanning the bracketed interval is read by a sequential pread.
  • Binary or linear scan within the block confirms the presence and location of the pattern.
  • SoftMatcha 2 guarantees that the block contains all possible matches of patterns up to length LL, with exactly one I/O (Yoneda et al., 11 Feb 2026).
  • RoSA can resolve frequent patterns entirely in-memory; for rare patterns, one sequential block fetch suffices (Gog et al., 2013).

This structure provides logarithmic time within RAM and a constant number of disk I/Os per query.

3. Block Formation and Storage Optimizations

Block layout directly influences disk footprint and access efficiency.

Uniform Sampling vs Prefix-Based Blocking

Design Aspect SoftMatcha 2 (Yoneda et al., 11 Feb 2026) RoSA (Gog et al., 2013)
Blocking strategy Uniform sampling (stride B’) Prefix-based, variable length
RAM index content L-gram codes only (sampling array) Condensed BWT + bitvectors
On-disk compression Run-length encoding of pointers Block reductions (BWT-single)
  • Uniform sampling partitions the suffix array at regular strides, facilitating straightforward block location but incurring redundancy if patterns are non-uniformly distributed.
  • Prefix-based blocking exploits common string prefixes, with each block representing maximal groups of suffixes sharing a prefix and not exceeding a fixed size bb. Many blocks are further reducible due to shared BWT context, enabling “redirects” that shrink disk storage by up to 50% compared with uniform sampling.

Block Reductions in RoSA

  • If the suffixes of a block all share the same preceding BWT symbol, the block’s pointers can be redirected into a parent’s subinterval. Blocks of size 1 (“singletons”) are stored directly in memory (Gog et al., 2013).

4. Complexity Analysis

Time Complexity

SoftMatcha 2:

Tlookup(C,L,B)=D+M(logCB+logB)D+MlogCT_{\rm lookup}(|C|,L,B) = D + M \left(\log \frac{|C|}{B} + \log B\right) \approx D + M \log |C|

with D=D = disk random-access latency, M=M = memory-access time, C=|C| = tokens, B=B = block stride (Yoneda et al., 11 Feb 2026).

RoSA:

  • In-memory prefix mapping by condensed BWT: O(mlogσ)O(m\log\sigma).
  • At most two disk blocks read per query; one suffices for irreducible cases (Gog et al., 2013).

Space Complexity

SoftMatcha 2:

  • RAM: O(CBLlogV)O\left( \frac{|C|}{B} \cdot L\log|V| \right) (sampling array).
  • Disk: O(CLlogV/8)O(|C| \lceil L\log|V|/8 \rceil); post-compression, 3×\approx 3\times7×7\times reduction (Yoneda et al., 11 Feb 2026).

RoSA:

  • In-memory: O((B+σ)logn+zlogσ)O((B + \sigma)\log n + z\log\sigma), where B=B = blocks, z=z = condensed BWT length, n=n = text size, σ=\sigma = alphabet size.
  • Disk: vblockvlog2n\sum_v |\mathrm{block}_v|\log_2 n for pointers, plus compressed LCPs and block metadata (Gog et al., 2013).
  • Only $1$–$3$% of text size required in RAM in practice; disk consumption \sim2× text size.

Disk I/O

  • One I/O per rare-pattern query in both structures; for “heavy” patterns (frequency above block size), all results can be resolved in-memory (Gog et al., 2013).

5. Integration with Soft Pattern Pruning

Disk-aware staged suffix arrays are amenable to semantic (“soft”) query extensions, as in SoftMatcha 2:

  • The RAM-disk division is embedded in iterative soft-pattern enumeration, where candidates are vetted first by in-memory similarity, then confirmed via exact lookup.
  • Additional k-gram cache layers (RAM-resident) enable quick rejection of frequent or implausible small patterns, bypassing disk entirely when possible.
  • “Last-bits pruning” enables efficient in-memory extension checks for low-frequency patterns by bulk-fetching occurrence pointers on a block scan (Yoneda et al., 11 Feb 2026).

Pseudocode for iterative soft search algorithms shows that disk access remains tightly bounded—one I/O per “hard hit”—regardless of the branching in the semantic relaxation protocol.

6. Empirical Evaluation and Comparative Results

Key empirical findings across published results:

Corpus Size Infini-gram p95 Staged SA (SoftMatcha 2) p95
100 B tokens 1.05 ms 0.03 ms
273 B tokens 3.45 ms 0.32 ms
1.4 T tokens 11.05 ms 0.34 ms
  • Disk-aware staged SAs achieve a 10×–33× speedup over Infini-gram and similar baselines (single SSD read per exact-match query).
  • On a 64 GB web text (RoSA), with block size 4 KiB and 2.5% RAM use, queries are answered in 0.3–1.1 ms (SSD) or 2–4 ms (HDD) (Gog et al., 2013).
  • “FM-indexes” can match these speeds only if the full index fits in memory, which is unattainable for trillion-token corpora.
  • For soft-search queries (e.g., K=20K=20, α=0.45\alpha=0.45), end-to-end p95 latency remains <300 ms at 1.4 T tokens in SoftMatcha 2, versus >4 s for FM-index-based approaches (Yoneda et al., 11 Feb 2026).

7. Context and Significance

Disk-aware staged suffix arrays underpin practical substring search and pattern-matching at scales beyond the reach of flat in-memory indexes. By mapping the search problem to a two-level index, these structures harness the speed of RAM for coarse search and the sequential-read efficiency of disk for fine-grained confirmation. Prefix-based block reductions (RoSA) further minimize disk space and improve locality, while run-length encoding and page-aligned reads (SoftMatcha 2) align with OS and hardware optimizations.

These advances have enabled:

  • Flexible pattern matching with semantic/ranked outputs (substitutions, insertions, deletions).
  • Large-scale corpus contamination detection in AI training data (Yoneda et al., 11 Feb 2026).
  • Efficient search over multilingual or domain-specific corpora.

A plausible implication is continued expansion of this approach in other large-scale information retrieval and NLP contexts, particularly where index size and latency dominate practical system design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Disk-Aware Staged Suffix Arrays.