Disk-Aware Staged Suffix Arrays

Updated 16 February 2026

Disk-Aware Staged Suffix Arrays are two-tier data structures that split search between a compact in-memory index and a compressed on-disk store to minimize random-access I/O.
They use a sampling vector or condensed BWT for rapid in-memory bracketing, followed by efficient sequential disk block inspection to confirm pattern matches.
Empirical evaluations show up to 33× speedup over traditional methods, making them essential for scalable, soft query matching on terabyte-scale corpora.

A disk-aware staged suffix array is a two-level (RAM and disk) data structure for scalable pattern search on massive corpora. It refines the classical suffix array by staging its components to minimize random-access I/O on disk, enabling millisecond search latency over terabyte- to trillion-token text collections with RAM footprints far below the corpus size. This approach is central to algorithms such as SoftMatcha 2 and RoSA, which leverage it to enable flexible matching—including semantic or “soft” queries—at scales where traditional FM-index or uniform-sampled suffix arrays are impractical (Yoneda et al., 11 Feb 2026, Gog et al., 2013).

1. Structural Overview

Disk-aware staged suffix arrays realize a separation between:

A compact in-memory index that rapidly restricts the locus of possible matches.
A compressed on-disk store of the bulk of suffix-array data, organized for efficient block-oriented access.

Key Components

In-Memory Component

Stores a sparse “sampling” array (SoftMatcha 2) or a condensed Burrows-Wheeler Transform (BWT) index (RoSA).
Maps patterns to block-level regions of the suffix array with logarithmic or near-linear time in the RAM footprint.
Designed to fit within a small fraction of the corpus size, often 1–3% in practice.

On-Disk Component

Holds the main lexicographically sorted array of pointers to corpus positions (suffix array) in compressed blocks.
Block boundaries are determined either by uniform sampling (every B’ entries) (Yoneda et al., 11 Feb 2026) or by variable-length prefix-based partitioning (RoSA) (Gog et al., 2013).
Run-length encoding and block reductions are employed to save space by eliminating redundancy.

This separation ensures that only a single or a small number of disk blocks are accessed per query, each transfer being page-aligned and thus efficient.

2. Query Protocol and Algorithms

Search proceeds in two principal stages:

Stage 1: In-Memory Bracketing

The query pattern is mapped to a small interval in the suffix array, as delineated by the sampling array or by BWT-based backward search.
In SoftMatcha 2, a binary search over a RAM-based sampling vector finds the bracket of size ≈B’ in $O(\log(|C|/B’))$ comparisons, where $|C|$ is corpus size (Yoneda et al., 11 Feb 2026).
In RoSA, condensed BWT supports backward search yielding the deepest matching prefix among disk blocks, typically in $O(m\log\sigma)$ time for pattern length $m$ (Gog et al., 2013).

Stage 2: Disk Block Inspection

A single on-disk block (≤B’ or ≤b entries) spanning the bracketed interval is read by a sequential pread.
Binary or linear scan within the block confirms the presence and location of the pattern.
SoftMatcha 2 guarantees that the block contains all possible matches of patterns up to length $L$ , with exactly one I/O (Yoneda et al., 11 Feb 2026).
RoSA can resolve frequent patterns entirely in-memory; for rare patterns, one sequential block fetch suffices (Gog et al., 2013).

This structure provides logarithmic time within RAM and a constant number of disk I/Os per query.

3. Block Formation and Storage Optimizations

Block layout directly influences disk footprint and access efficiency.

Uniform Sampling vs Prefix-Based Blocking

Design Aspect	SoftMatcha 2 (Yoneda et al., 11 Feb 2026)	RoSA (Gog et al., 2013)
Blocking strategy	Uniform sampling (stride B’)	Prefix-based, variable length
RAM index content	L-gram codes only (sampling array)	Condensed BWT + bitvectors
On-disk compression	Run-length encoding of pointers	Block reductions (BWT-single)

Uniform sampling partitions the suffix array at regular strides, facilitating straightforward block location but incurring redundancy if patterns are non-uniformly distributed.
Prefix-based blocking exploits common string prefixes, with each block representing maximal groups of suffixes sharing a prefix and not exceeding a fixed size $b$ . Many blocks are further reducible due to shared BWT context, enabling “redirects” that shrink disk storage by up to 50% compared with uniform sampling.

Block Reductions in RoSA

If the suffixes of a block all share the same preceding BWT symbol, the block’s pointers can be redirected into a parent’s subinterval. Blocks of size 1 (“singletons”) are stored directly in memory (Gog et al., 2013).

4. Complexity Analysis

Time Complexity

SoftMatcha 2:

$T_{\rm lookup}(|C|,L,B) = D + M \left(\log \frac{|C|}{B} + \log B\right) \approx D + M \log |C|$

with $D =$ disk random-access latency, $M =$ memory-access time, $|C| =$ tokens, $B =$ block stride (Yoneda et al., 11 Feb 2026).

RoSA:

In-memory prefix mapping by condensed BWT: $O(m\log\sigma)$ .
At most two disk blocks read per query; one suffices for irreducible cases (Gog et al., 2013).

Space Complexity

SoftMatcha 2:

RAM: $O\left( \frac{|C|}{B} \cdot L\log|V| \right)$ (sampling array).
Disk: $O(|C| \lceil L\log|V|/8 \rceil)$ ; post-compression, $\approx 3\times$ – $7\times$ reduction (Yoneda et al., 11 Feb 2026).

RoSA:

In-memory: $O((B + \sigma)\log n + z\log\sigma)$ , where $B =$ blocks, $z =$ condensed BWT length, $n =$ text size, $\sigma =$ alphabet size.
Disk: $\sum_v |\mathrm{block}_v|\log_2 n$ for pointers, plus compressed LCPs and block metadata (Gog et al., 2013).
Only $1$–$3$% of text size required in RAM in practice; disk consumption $\sim$ 2× text size.

Disk I/O

One I/O per rare-pattern query in both structures; for “heavy” patterns (frequency above block size), all results can be resolved in-memory (Gog et al., 2013).

5. Integration with Soft Pattern Pruning

Disk-aware staged suffix arrays are amenable to semantic (“soft”) query extensions, as in SoftMatcha 2:

The RAM-disk division is embedded in iterative soft-pattern enumeration, where candidates are vetted first by in-memory similarity, then confirmed via exact lookup.
Additional k-gram cache layers (RAM-resident) enable quick rejection of frequent or implausible small patterns, bypassing disk entirely when possible.
“Last-bits pruning” enables efficient in-memory extension checks for low-frequency patterns by bulk-fetching occurrence pointers on a block scan (Yoneda et al., 11 Feb 2026).

Pseudocode for iterative soft search algorithms shows that disk access remains tightly bounded—one I/O per “hard hit”—regardless of the branching in the semantic relaxation protocol.

6. Empirical Evaluation and Comparative Results

Key empirical findings across published results:

Corpus Size	Infini-gram p95	Staged SA (SoftMatcha 2) p95
100 B tokens	1.05 ms	0.03 ms
273 B tokens	3.45 ms	0.32 ms
1.4 T tokens	11.05 ms	0.34 ms

Disk-aware staged SAs achieve a 10×–33× speedup over Infini-gram and similar baselines (single SSD read per exact-match query).
On a 64 GB web text (RoSA), with block size 4 KiB and 2.5% RAM use, queries are answered in 0.3–1.1 ms (SSD) or 2–4 ms (HDD) (Gog et al., 2013).
“FM-indexes” can match these speeds only if the full index fits in memory, which is unattainable for trillion-token corpora.
For soft-search queries (e.g., $K=20$ , $\alpha=0.45$ ), end-to-end p95 latency remains <300 ms at 1.4 T tokens in SoftMatcha 2, versus >4 s for FM-index-based approaches (Yoneda et al., 11 Feb 2026).

7. Context and Significance

Disk-aware staged suffix arrays underpin practical substring search and pattern-matching at scales beyond the reach of flat in-memory indexes. By mapping the search problem to a two-level index, these structures harness the speed of RAM for coarse search and the sequential-read efficiency of disk for fine-grained confirmation. Prefix-based block reductions (RoSA) further minimize disk space and improve locality, while run-length encoding and page-aligned reads (SoftMatcha 2) align with OS and hardware optimizations.

These advances have enabled:

Flexible pattern matching with semantic/ranked outputs (substitutions, insertions, deletions).
Large-scale corpus contamination detection in AI training data (Yoneda et al., 11 Feb 2026).
Efficient search over multilingual or domain-specific corpora.

A plausible implication is continued expansion of this approach in other large-scale information retrieval and NLP contexts, particularly where index size and latency dominate practical system design.

Markdown Report Issue Upgrade to Chat

References (2)

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora (2026)

Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays (2013)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Disk-Aware Staged Suffix Arrays.

Disk-Aware Staged Suffix Arrays

1. Structural Overview

Key Components

2. Query Protocol and Algorithms

Stage 1: In-Memory Bracketing

Stage 2: Disk Block Inspection

3. Block Formation and Storage Optimizations

Uniform Sampling vs Prefix-Based Blocking

4. Complexity Analysis

Time Complexity

Space Complexity

Disk I/O

5. Integration with Soft Pattern Pruning

6. Empirical Evaluation and Comparative Results

7. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Disk-Aware Staged Suffix Arrays

1. Structural Overview

Key Components

2. Query Protocol and Algorithms

Stage 1: In-Memory Bracketing

Stage 2: Disk Block Inspection

3. Block Formation and Storage Optimizations

Uniform Sampling vs Prefix-Based Blocking

4. Complexity Analysis

Time Complexity

Space Complexity

Disk I/O

5. Integration with Soft Pattern Pruning

6. Empirical Evaluation and Comparative Results

7. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research