Extended Burrows-Wheeler Transform (eBWT)

Updated 28 November 2025

Extended Burrows-Wheeler Transform (eBWT) is a generalization of the classical BWT that processes collections of strings, graphs, or combinatorial structures for advanced indexing and compression.
Algorithmic strategies such as GSA-based methods, suffix array adaptations, and Lyndon grammar approaches achieve efficient eBWT construction with linear or near-linear time complexity for large datasets.
Practical applications include reference‐free genomic analysis, pattern matching, and compressed indexing of highly repetitive data, while open challenges remain in optimal run decomposition and dynamic updates.

The extended Burrows–Wheeler Transform (eBWT) generalizes the classical Burrows–Wheeler Transform from single-string input to collections or multisets of strings, graphs, or even more general combinatorial structures. This extension preserves and amplifies many of the combinatorial, algebraic, and algorithmic properties that make the original transform central in text indexing, compression, and bioinformatics, especially in settings where data repetitiveness and structural complexity are paramount. The eBWT underlies a wide spectrum of advanced data structures, including those for reference-free genomic analysis, suffix and Wheeler graph-based indexes, and compressed representations of highly repetitive datasets.

1. Mathematical Foundations and Formal Definitions

The eBWT for a multiset of strings $M = \{S^1, S^2, ..., S^k\}$ over a totally ordered alphabet $\Sigma$ is constructed by considering all conjugates (cyclic rotations) of the strings in $M$ , forming a collection $C$ , and sorting $C$ in infinite periodic order $<_\infty$ , whereby $U <_\infty V$ iff $U^\infty$ is lexicographically smaller than $V^\infty$ . The transform $eBWT(M)$ is the sequence of characters that precede each sorted conjugate in this order, yielding a string of length $\Sigma$ 0 (Olbrich, 27 Apr 2025, Higgins, 2019, Ingels et al., 5 Jun 2025).

For string collections, the framework can also be formulated via the generalized suffix array (GSA) and the multi-string BWT:

GSA $\Sigma$ 1 indicates that the $\Sigma$ 2-th smallest suffix (over all augmented strings $\Sigma$ 3 _r $\Sigma$ 4j $\Sigma$ 5S_r $.</li> <li>$ \Sigma$6 is $\Sigma$7 if $\Sigma$8 and $\Sigma $9_r$ M$0j=1$ (Prezza et al., 2018).

The construction extends smoothly to more general objects. On Wheeler graphs, the eBWT is the string of edge-labels in Wheeler order (Alanko et al., 2018), and further generalizations (e.g., tree-structured input, see XBWT) are available for labeled trees (Gagie et al., 2018).

2. Algorithmic Construction and Complexity

Multiple algorithmic approaches exist for eBWT construction across different settings:

GSA-based: For a collection of $M$ 1 strings, compute the GSA and corresponding LCP array in $M$ 2 time and space, with $M$ 3 (Prezza et al., 2018).
Suffix array adaptation: For primitive factors, one can canonicalize each input string to its minimal conjugate, sort them, concatenate, and compute the BBWT by constructing the circular suffix array (CSA) under $M$ 4 allowing linear-time construction via an SAIS extension (Bannai et al., 2019).
Lyndon grammar approach: For large, repetitive datasets, grammar compression is leveraged by building Lyndon straight-line programs (SLPs) representing the input corpus. Construction involves three phases: grammar construction (Duval’s algorithm streaming right-to-left), lexicographic sorting of nonterminals (via first-symbol forests), and eBWT extraction by enumerating conjugates in infinite periodic order. Time and space can be brought to $M$ 5 when the grammar size $M$ 6 is much smaller than input length $M$ 7, and practical implementations can further run-length encode the BWT for additional compression (Olbrich, 27 Apr 2025).
Dollarless and tree-based: For aligned readsets, eliminating separator symbols lowers run count. The XBWT model removes explicit symbols by leveraging the structure of a labeled tree that encodes reads and their alignments, with sorting by upward-path labels replacing suffix arrays. The overall cost becomes $M$ 8, where $M$ 9 is the number of edges (Gagie et al., 2018).

Parallelization (e.g., thread-parallel Lyndon parsing), hash-consing of grammar rules, and hybrid in-memory/external-memory schemes enable scalability to datasets comprising tens of billions of characters (Olbrich, 27 Apr 2025).

3. Structural and Algebraic Properties

The eBWT is bijective on its combinatorial domain (e.g., primitive necklaces, multiset of strings) and supports invertibility akin to the original BWT. The inverse is performed by reconstructing the standard permutation associated to the output via order-preserving partial maps between the sorted and original output positions, followed by decomposing the permutation into its cycles, each corresponding to a primitive in the original input (Higgins, 2019). This cycle-based inversion generalizes classical BWT inversion and is practical in $C$ 0 time ( $C$ 1 output length, $C$ 2 alphabet size).

The algebraic perspective connects the induced partial maps to syntactic semigroups of cyclic languages. Notably, for primitive $C$ 3, $C$ 4 links the eBWT’s induced semigroup to language-theoretic constructs (Higgins, 2019). Combination with Lyndon factorizations and bijections facilitates combinatorial enumeration of special objects (e.g., de Bruijn words).

For tree- and graph-based models, the eBWT generalizes to Wheeler graphs, where the ordering constraints on vertices and edges yield efficient rank/select-enabled navigation and ensure correctness of FM-index operations after tunneling and other nontrivial transformations (Alanko et al., 2018).

4. Run Count, Decomposition, and RLE Compressibility

Run-length encoding (RLE) size is determined by the number of runs, defined as the number of maximal constant blocks in the eBWT output (Ingels et al., 5 Jun 2025, Gagie et al., 2018). The choice of decomposition into substrings or tree/graph paths is pivotal:

The number of decompositions of a word $C$ 5 into factors greater than $C$ 6 is exponential in $C$ 7 (generalized Fibonacci growth). There exists a universal upper bound for the minimum run count among $C$ 8-constrained decompositions, and for certain words, the ratio between maximal and minimal run counts across all decompositions is unbounded (Ingels et al., 5 Jun 2025).
For genomic readsets, omitting end-markers from the eBWT (dollarless variant) reduces the number of runs by nearly 19%, with tree-based XBWT lowering the run count by a further 15% (Gagie et al., 2018). Lower run counts directly improve space usage in run-length FM-indexes and accelerate rank/select queries.

Tunneling techniques, originally for repetitive datasets, can be applied to Wheeler graphs after eBWT construction, further compressing repeated isomorphic blocks beyond what RLE can achieve (Alanko et al., 2018).

5. Applications Across Bioinformatics, Indexing, and Pattern Matching

The eBWT paradigm is exploited across a spectrum of computational biology and text retrieval tasks:

Reference-free/Alignment-free genomics: eBWT supports clustering of reads covering the same genomic locus, with the expected cluster size and boundaries governed by Poisson statistics and determined rapidly via LCP array monotonicity (single scan for cluster detection) (Prezza et al., 2018).
SNP/INDEL detection: By leveraging the positional clustering effect in eBWT and LCP arrays, small variants and sequencing errors can be detected without prior reference, and filtered via Poisson model expectations (Prezza et al., 2018).
Haplotype phasing, alternative splicing, error correction, assembly overlap detection: All these tasks benefit from consensus detection inside eBWT clusters, with the general workflow being eBWT+LCP construction, cluster delimitation by LCP local minima, filtering by cluster size/distribution, and local reconstruction (Prezza et al., 2018).
Indexing repetitive databases: State-of-the-art indexes (e.g., run-length FM-indexes, XBWTs) for highly repetitive sequence collections rely on the eBWT for compressibility and navigation efficiency (Olbrich, 27 Apr 2025, Gagie et al., 2018).
Tree and graph-based indexing: The generalized eBWT naturally extends to labeled trees (XBWT), Wheeler graphs, and advanced objects supporting efficient counting/locating/extracting (Gagie et al., 2018, Alanko et al., 2018).
Pattern matching on circular or structurally complex texts: The eBWT and its extensions (cBWT, XBWT) can support pattern match queries under generalized equivalences (e.g., Cartesian-tree matching), with succinct data structures and dynamic update capabilities (Osterkamp et al., 2024).

6. Practical Performance and Benchmarks

Empirical studies demonstrate that grammar-based or Lyndon-grammar-based eBWT construction is feasible at the scale of human chromosomes or entire viral pangenomes, with significant memory and time benefits over classical suffix array-based BWT construction (Olbrich, 27 Apr 2025). For instance:

For 1000 human Chromosome 19 haplotypes (total ≈ $C$ 9 bp): Lyndon grammar eBWT completes in 1444s (2.0 GiB RAM), outperforming libsais, r-pfbwt, CMS-BWT, and ropebwt3.
For $C$ 0 SARS-CoV-2 genomes (≈ $C$ 1 bp): eBWT in 782s, using 2.7 GiB.
Multithreading further reduces runtime to seconds for these datasets while keeping memory <10 GiB (Olbrich, 27 Apr 2025).

Run-count reductions (e.g., dollarless eBWT and XBWT) translate to full-text indexes up to 30% smaller with no loss in indexing functionality (Gagie et al., 2018).

7. Limitations, Open Problems, and Ongoing Developments

Fundamental limitations arise from the combinatorial explosion of possible decompositions: brute-force optimization of run counts is infeasible due to exponential growth in decomposition space (Ingels et al., 5 Jun 2025). Known decompositions such as the Lyndon factorization are not universally optimal with respect to RLE (Ingels et al., 5 Jun 2025). Practical heuristics to discover nearly minimal run decompositions are an open problem.

Tunneling and block-based compression techniques offer additional reductions in index size, especially on Wheeler graphs, but the design and implementation of optimal tunneling algorithms remain active areas of research (Alanko et al., 2018).

Extensions to dynamic, online, or partial-indexing for eBWT-based data structures (e.g., incremental construction, dynamic insertions in cBWT) are studied and possess succinct space guarantees, but with polylogarithmic slowdowns on update/query (Osterkamp et al., 2024).

The algebraic and combinatorial underpinnings of eBWT inversion, its connections to syntactic semigroups, and theoretical properties (e.g., de Bruijn word construction, factor enumeration) continue to enrich both the foundational and applied aspects of the eBWT framework (Higgins, 2019).