Disk-Aware Staged Suffix Arrays
- Disk-Aware Staged Suffix Arrays are two-tier data structures that split search between a compact in-memory index and a compressed on-disk store to minimize random-access I/O.
- They use a sampling vector or condensed BWT for rapid in-memory bracketing, followed by efficient sequential disk block inspection to confirm pattern matches.
- Empirical evaluations show up to 33× speedup over traditional methods, making them essential for scalable, soft query matching on terabyte-scale corpora.
A disk-aware staged suffix array is a two-level (RAM and disk) data structure for scalable pattern search on massive corpora. It refines the classical suffix array by staging its components to minimize random-access I/O on disk, enabling millisecond search latency over terabyte- to trillion-token text collections with RAM footprints far below the corpus size. This approach is central to algorithms such as SoftMatcha 2 and RoSA, which leverage it to enable flexible matching—including semantic or “soft” queries—at scales where traditional FM-index or uniform-sampled suffix arrays are impractical (Yoneda et al., 11 Feb 2026, Gog et al., 2013).
1. Structural Overview
Disk-aware staged suffix arrays realize a separation between:
- A compact in-memory index that rapidly restricts the locus of possible matches.
- A compressed on-disk store of the bulk of suffix-array data, organized for efficient block-oriented access.
Key Components
In-Memory Component
- Stores a sparse “sampling” array (SoftMatcha 2) or a condensed Burrows-Wheeler Transform (BWT) index (RoSA).
- Maps patterns to block-level regions of the suffix array with logarithmic or near-linear time in the RAM footprint.
- Designed to fit within a small fraction of the corpus size, often 1–3% in practice.
On-Disk Component
- Holds the main lexicographically sorted array of pointers to corpus positions (suffix array) in compressed blocks.
- Block boundaries are determined either by uniform sampling (every B’ entries) (Yoneda et al., 11 Feb 2026) or by variable-length prefix-based partitioning (RoSA) (Gog et al., 2013).
- Run-length encoding and block reductions are employed to save space by eliminating redundancy.
This separation ensures that only a single or a small number of disk blocks are accessed per query, each transfer being page-aligned and thus efficient.
2. Query Protocol and Algorithms
Search proceeds in two principal stages:
Stage 1: In-Memory Bracketing
- The query pattern is mapped to a small interval in the suffix array, as delineated by the sampling array or by BWT-based backward search.
- In SoftMatcha 2, a binary search over a RAM-based sampling vector finds the bracket of size ≈B’ in comparisons, where is corpus size (Yoneda et al., 11 Feb 2026).
- In RoSA, condensed BWT supports backward search yielding the deepest matching prefix among disk blocks, typically in time for pattern length (Gog et al., 2013).
Stage 2: Disk Block Inspection
- A single on-disk block (≤B’ or ≤b entries) spanning the bracketed interval is read by a sequential pread.
- Binary or linear scan within the block confirms the presence and location of the pattern.
- SoftMatcha 2 guarantees that the block contains all possible matches of patterns up to length , with exactly one I/O (Yoneda et al., 11 Feb 2026).
- RoSA can resolve frequent patterns entirely in-memory; for rare patterns, one sequential block fetch suffices (Gog et al., 2013).
This structure provides logarithmic time within RAM and a constant number of disk I/Os per query.
3. Block Formation and Storage Optimizations
Block layout directly influences disk footprint and access efficiency.
Uniform Sampling vs Prefix-Based Blocking
| Design Aspect | SoftMatcha 2 (Yoneda et al., 11 Feb 2026) | RoSA (Gog et al., 2013) |
|---|---|---|
| Blocking strategy | Uniform sampling (stride B’) | Prefix-based, variable length |
| RAM index content | L-gram codes only (sampling array) | Condensed BWT + bitvectors |
| On-disk compression | Run-length encoding of pointers | Block reductions (BWT-single) |
- Uniform sampling partitions the suffix array at regular strides, facilitating straightforward block location but incurring redundancy if patterns are non-uniformly distributed.
- Prefix-based blocking exploits common string prefixes, with each block representing maximal groups of suffixes sharing a prefix and not exceeding a fixed size . Many blocks are further reducible due to shared BWT context, enabling “redirects” that shrink disk storage by up to 50% compared with uniform sampling.
Block Reductions in RoSA
- If the suffixes of a block all share the same preceding BWT symbol, the block’s pointers can be redirected into a parent’s subinterval. Blocks of size 1 (“singletons”) are stored directly in memory (Gog et al., 2013).
4. Complexity Analysis
Time Complexity
SoftMatcha 2:
with disk random-access latency, memory-access time, tokens, block stride (Yoneda et al., 11 Feb 2026).
RoSA:
- In-memory prefix mapping by condensed BWT: .
- At most two disk blocks read per query; one suffices for irreducible cases (Gog et al., 2013).
Space Complexity
SoftMatcha 2:
- RAM: (sampling array).
- Disk: ; post-compression, – reduction (Yoneda et al., 11 Feb 2026).
RoSA:
- In-memory: , where blocks, condensed BWT length, text size, alphabet size.
- Disk: for pointers, plus compressed LCPs and block metadata (Gog et al., 2013).
- Only $1$–$3$% of text size required in RAM in practice; disk consumption 2× text size.
Disk I/O
- One I/O per rare-pattern query in both structures; for “heavy” patterns (frequency above block size), all results can be resolved in-memory (Gog et al., 2013).
5. Integration with Soft Pattern Pruning
Disk-aware staged suffix arrays are amenable to semantic (“soft”) query extensions, as in SoftMatcha 2:
- The RAM-disk division is embedded in iterative soft-pattern enumeration, where candidates are vetted first by in-memory similarity, then confirmed via exact lookup.
- Additional k-gram cache layers (RAM-resident) enable quick rejection of frequent or implausible small patterns, bypassing disk entirely when possible.
- “Last-bits pruning” enables efficient in-memory extension checks for low-frequency patterns by bulk-fetching occurrence pointers on a block scan (Yoneda et al., 11 Feb 2026).
Pseudocode for iterative soft search algorithms shows that disk access remains tightly bounded—one I/O per “hard hit”—regardless of the branching in the semantic relaxation protocol.
6. Empirical Evaluation and Comparative Results
Key empirical findings across published results:
| Corpus Size | Infini-gram p95 | Staged SA (SoftMatcha 2) p95 |
|---|---|---|
| 100 B tokens | 1.05 ms | 0.03 ms |
| 273 B tokens | 3.45 ms | 0.32 ms |
| 1.4 T tokens | 11.05 ms | 0.34 ms |
- Disk-aware staged SAs achieve a 10×–33× speedup over Infini-gram and similar baselines (single SSD read per exact-match query).
- On a 64 GB web text (RoSA), with block size 4 KiB and 2.5% RAM use, queries are answered in 0.3–1.1 ms (SSD) or 2–4 ms (HDD) (Gog et al., 2013).
- “FM-indexes” can match these speeds only if the full index fits in memory, which is unattainable for trillion-token corpora.
- For soft-search queries (e.g., , ), end-to-end p95 latency remains <300 ms at 1.4 T tokens in SoftMatcha 2, versus >4 s for FM-index-based approaches (Yoneda et al., 11 Feb 2026).
7. Context and Significance
Disk-aware staged suffix arrays underpin practical substring search and pattern-matching at scales beyond the reach of flat in-memory indexes. By mapping the search problem to a two-level index, these structures harness the speed of RAM for coarse search and the sequential-read efficiency of disk for fine-grained confirmation. Prefix-based block reductions (RoSA) further minimize disk space and improve locality, while run-length encoding and page-aligned reads (SoftMatcha 2) align with OS and hardware optimizations.
These advances have enabled:
- Flexible pattern matching with semantic/ranked outputs (substitutions, insertions, deletions).
- Large-scale corpus contamination detection in AI training data (Yoneda et al., 11 Feb 2026).
- Efficient search over multilingual or domain-specific corpora.
A plausible implication is continued expansion of this approach in other large-scale information retrieval and NLP contexts, particularly where index size and latency dominate practical system design.