Extended Burrows-Wheeler Transform (eBWT) is a generalization of the classical BWT that processes collections of strings, graphs, or combinatorial structures for advanced indexing and compression.
Algorithmic strategies such as GSA-based methods, suffix array adaptations, and Lyndon grammar approaches achieve efficient eBWT construction with linear or near-linear time complexity for large datasets.
Practical applications include reference‐free genomic analysis, pattern matching, and compressed indexing of highly repetitive data, while open challenges remain in optimal run decomposition and dynamic updates.
The extended Burrows–Wheeler Transform (eBWT) generalizes the classical Burrows–Wheeler Transform from single-string input to collections or multisets of strings, graphs, or even more general combinatorial structures. This extension preserves and amplifies many of the combinatorial, algebraic, and algorithmic properties that make the original transform central in text indexing, compression, and bioinformatics, especially in settings where data repetitiveness and structural complexity are paramount. The eBWT underlies a wide spectrum of advanced data structures, including those for reference-free genomic analysis, suffix and Wheeler graph-based indexes, and compressed representations of highly repetitive datasets.
1. Mathematical Foundations and Formal Definitions
The eBWT for a multiset of strings M={S1,S2,...,Sk} over a totally ordered alphabet Σ is constructed by considering all conjugates (cyclic rotations) of the strings in M, forming a collection C, and sorting C in infinite periodic order <∞, whereby U<∞V iff U∞ is lexicographically smaller than V∞. The transform eBWT(M) is the sequence of characters that precede each sorted conjugate in this order, yielding a string of length N=∑S∈M∣S∣ (Olbrich, 27 Apr 2025, Higgins, 2019, Ingels et al., 5 Jun 2025).
For string collections, the framework can also be formulated via the generalized suffix array (GSA) and the multi-string BWT:
GSA[i]=(j,r) indicates that the i-th smallest suffix (over all augmented strings $S_r\$ _r)startsatpositionjinS_r.</li><li>eBWT[i]isS_r[j-1]ifj>1and\$_rifj=1(<ahref="/papers/1805.01876"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Prezzaetal.,2018</a>).</li></ul><p>Theconstructionextendssmoothlytomoregeneralobjects.OnWheelergraphs,theeBWTisthestringofedge−labelsinWheelerorder(<ahref="/papers/1811.02457"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Alankoetal.,2018</a>),andfurthergeneralizations(e.g.,tree−structuredinput,seeXBWT)areavailableforlabeledtrees(<ahref="/papers/1809.07320"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Gagieetal.,2018</a>).</p><h2class=′paper−heading′id=′algorithmic−construction−and−complexity′>2.AlgorithmicConstructionandComplexity</h2><p>MultiplealgorithmicapproachesexistforeBWTconstructionacrossdifferentsettings:</p><ul><li><strong>GSA−based:</strong>Foracollectionofmstrings,computetheGSAandcorrespondingLCParrayinO(P)timeandspace,withP = \sum_i |S_i|(<ahref="/papers/1805.01876"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Prezzaetal.,2018</a>).</li><li><strong>Suffixarrayadaptation:</strong>Forprimitivefactors,onecancanonicalizeeachinputstringtoitsminimalconjugate,sortthem,concatenate,andcomputetheBBWTbyconstructingthecircularsuffixarray(CSA)under<_\inftyallowinglinear−timeconstructionviaanSAISextension(<ahref="/papers/1911.06985"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Bannaietal.,2019</a>).</li><li><strong>Lyndongrammarapproach:</strong>Forlarge,repetitivedatasets,grammarcompressionisleveragedbybuildingLyndonstraight−lineprograms(SLPs)representingtheinputcorpus.Constructioninvolvesthreephases:grammarconstruction(Duval’salgorithmstreamingright−to−left),lexicographicsortingofnonterminals(viafirst−symbolforests),andeBWTextractionbyenumeratingconjugatesininfiniteperiodicorder.TimeandspacecanbebroughttoO(N+g)whenthegrammarsizegismuchsmallerthaninputlengthN,andpracticalimplementationscanfurtherrun−lengthencodetheBWTforadditionalcompression(<ahref="/papers/2504.19123"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Olbrich,27Apr2025</a>).</li><li><strong>Dollarlessandtree−based:</strong>Foralignedreadsets,eliminatingseparatorsymbolslowersruncount.TheXBWTmodelremovesexplicitsymbolsbyleveragingthestructureofalabeledtreethatencodesreadsandtheiralignments,withsortingbyupward−pathlabelsreplacingsuffixarrays.TheoverallcostbecomesO(m \log m),wheremisthenumberofedges(<ahref="/papers/1809.07320"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Gagieetal.,2018</a>).</li></ul><p>Parallelization(e.g.,thread−parallelLyndonparsing),hash−consingofgrammarrules,and<ahref="https://www.emergentmind.com/topics/hg−tnet−hybrid"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">hybrid</a>in−memory/external−memoryschemesenablescalabilitytodatasetscomprisingtensofbillionsofcharacters(<ahref="/papers/2504.19123"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Olbrich,27Apr2025</a>).</p><h2class=′paper−heading′id=′structural−and−algebraic−properties′>3.StructuralandAlgebraicProperties</h2><p>TheeBWTisbijectiveonitscombinatorialdomain(e.g.,primitivenecklaces,multisetofstrings)andsupportsinvertibilityakintotheoriginalBWT.Theinverseisperformedbyreconstructingthestandardpermutationassociatedtotheoutputviaorder−preservingpartialmapsbetweenthesortedandoriginaloutputpositions,followedbydecomposingthepermutationintoitscycles,eachcorrespondingtoaprimitiveintheoriginalinput(<ahref="/papers/1901.08392"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Higgins,2019</a>).Thiscycle−basedinversiongeneralizesclassicalBWTinversionandispracticalinO(N\log k)time(Noutputlength,kalphabetsize).</p><p>Thealgebraicperspectiveconnectstheinducedpartialmapstosyntacticsemigroupsofcycliclanguages.Notably,forprimitiveu,S(u) \cong S_ulinkstheeBWT’sinducedsemigrouptolanguage−theoreticconstructs(<ahref="/papers/1901.08392"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Higgins,2019</a>).CombinationwithLyndonfactorizationsandbijectionsfacilitatescombinatorialenumerationofspecialobjects(e.g.,deBruijnwords).</p><p>Fortree−andgraph−basedmodels,theeBWTgeneralizestoWheelergraphs,wheretheorderingconstraintsonverticesandedgesyieldefficientrank/select−enablednavigationandensurecorrectnessofFM−indexoperationsaftertunnelingandothernontrivialtransformations(<ahref="/papers/1811.02457"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Alankoetal.,2018</a>).</p><h2class=′paper−heading′id=′run−count−decomposition−and−rle−compressibility′>4.RunCount,Decomposition,andRLECompressibility</h2><p>Run−lengthencoding(RLE)sizeisdeterminedbythenumberofruns,definedasthenumberofmaximalconstantblocksintheeBWToutput(<ahref="/papers/2506.04926"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Ingelsetal.,5Jun2025</a>,<ahref="/papers/1809.07320"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Gagieetal.,2018</a>).Thechoiceofdecompositionintosubstringsortree/graphpathsispivotal:</p><ul><li>Thenumberofdecompositionsofawordwintofactorsgreaterthankisexponentialin|w|(generalizedFibonaccigrowth).Thereexistsauniversalupperboundfortheminimumruncountamongk$-constrained decompositions, and for certain words, the ratio between maximal and minimal run counts across all decompositions is unbounded (Ingels et al., 5 Jun 2025).
For genomic readsets, omitting end-markers from the eBWT (dollarless variant) reduces the number of runs by nearly 19%, with tree-based XBWT lowering the run count by a further 15% (Gagie et al., 2018). Lower run counts directly improve space usage in run-length FM-indexes and accelerate rank/select queries.
Tunneling techniques, originally for repetitive datasets, can be applied to Wheeler graphs after eBWT construction, further compressing repeated isomorphic blocks beyond what RLE can achieve (Alanko et al., 2018).
5. Applications Across Bioinformatics, Indexing, and Pattern Matching
The eBWT paradigm is exploited across a spectrum of computational biology and text retrieval tasks:
Reference-free/Alignment-free genomics: eBWT supports clustering of reads covering the same genomic locus, with the expected cluster size and boundaries governed by Poisson statistics and determined rapidly via LCP array monotonicity (single scan for cluster detection) (Prezza et al., 2018).
SNP/INDEL detection: By leveraging the positional clustering effect in eBWT and LCP arrays, small variants and sequencing errors can be detected without prior reference, and filtered via Poisson model expectations (Prezza et al., 2018).
Haplotype phasing, alternative splicing, error correction, assembly overlap detection: All these tasks benefit from consensus detection inside eBWT clusters, with the general workflow being eBWT+LCP construction, cluster delimitation by LCP local minima, filtering by cluster size/distribution, and local reconstruction (Prezza et al., 2018).
Indexing repetitive databases: State-of-the-art indexes (e.g., run-length FM-indexes, XBWTs) for highly repetitive sequence collections rely on the eBWT for compressibility and navigation efficiency (Olbrich, 27 Apr 2025, Gagie et al., 2018).
Tree and graph-based indexing: The generalized eBWT naturally extends to labeled trees (XBWT), Wheeler graphs, and advanced objects supporting efficient counting/locating/extracting (Gagie et al., 2018, Alanko et al., 2018).
Pattern matching on circular or structurally complex texts: The eBWT and its extensions (cBWT, XBWT) can support pattern match queries under generalized equivalences (e.g., Cartesian-tree matching), with succinct data structures and dynamic update capabilities (Osterkamp et al., 2024).
6. Practical Performance and Benchmarks
Empirical studies demonstrate that grammar-based or Lyndon-grammar-based eBWT construction is feasible at the scale of human chromosomes or entire viral pangenomes, with significant memory and time benefits over classical suffix array-based BWT construction (Olbrich, 27 Apr 2025). For instance:
For 1000 human Chromosome 19 haplotypes (total ≈6×1010 bp): Lyndon grammar eBWT completes in 1444s (2.0 GiB RAM), outperforming libsais, r-pfbwt, CMS-BWT, and ropebwt3.
For 106 SARS-CoV-2 genomes (≈3×1010 bp): eBWT in 782s, using 2.7 GiB.
Multithreading further reduces runtime to seconds for these datasets while keeping memory <10 GiB (Olbrich, 27 Apr 2025).
Run-count reductions (e.g., dollarless eBWT and XBWT) translate to full-text indexes up to 30% smaller with no loss in indexing functionality (Gagie et al., 2018).
7. Limitations, Open Problems, and Ongoing Developments
Fundamental limitations arise from the combinatorial explosion of possible decompositions: brute-force optimization of run counts is infeasible due to exponential growth in decomposition space (Ingels et al., 5 Jun 2025). Known decompositions such as the Lyndon factorization are not universally optimal with respect to RLE (Ingels et al., 5 Jun 2025). Practical heuristics to discover nearly minimal run decompositions are an open problem.
Tunneling and block-based compression techniques offer additional reductions in index size, especially on Wheeler graphs, but the design and implementation of optimal tunneling algorithms remain active areas of research (Alanko et al., 2018).
Extensions to dynamic, online, or partial-indexing for eBWT-based data structures (e.g., incremental construction, dynamic insertions in cBWT) are studied and possess succinct space guarantees, but with polylogarithmic slowdowns on update/query (Osterkamp et al., 2024).
The algebraic and combinatorial underpinnings of eBWT inversion, its connections to syntactic semigroups, and theoretical properties (e.g., de Bruijn word construction, factor enumeration) continue to enrich both the foundational and applied aspects of the eBWT framework (Higgins, 2019).