Dollarless eBWT is a sentinel-free extended Burrows–Wheeler Transform that processes cyclic rotations under ω-order to enable efficient text indexing and genome analysis.
It employs a linear-time, SAIS-inspired algorithm using cyclic S/L typing, bucket-based induction, and recursive naming to sort rotations without added markers.
Its practical benefits include reduced run counts and improved compressibility, making it ideal for high-throughput genomic datasets and compressed full‐text indexing.
The dollarless eBWT (extended Burrows–Wheeler Transform) encompasses a class of transformations that extend or reconstruct Burrows–Wheeler-style text permutations, originally devised for string compression and indexing, but explicitly without the need for end-marker (sentinel, dollar "$") characters. This approach has significant theoretical, algorithmic, and practical ramifications for text collections, compressed indexing, and genome analysis, providing both efficiency improvements and new guarantees over classical constructions.</p>
<h2 class='paper-heading' id='formal-definitions-and-core-properties'>1. Formal Definitions and Core Properties</h2>
<p>The dollarless eBWT generalizes the classic BWT by omitting any special end-of-string character. Consider a multiset of primitive (non-periodic) strings $\mathcal M = \{T_1, \dots, T_m\}overafinite,orderedalphabet\Sigma,where|T_i| = n_iandtotallengthN = \sum_i n_i.ForeachT_i,alln_icyclicrotationsareconsidered,andindicesaretreatedmodulon_i.Thegeneralizedconjugatearrayisobtainedbycollectingallthesecyclicrotationsandsortingthemundertheso−called\omega−order,characterizedby:</p><ul><li>ForU, V \in \Sigma^*,U \prec_\omega Vif(a)U, VhavethesameprimitiverootandU <_{\text{lex}} V,or(b)U^\omega <_{\text{lex}} V^\omegaotherwise,withU^\omegadenotinginfiniterepetitionofU.</li></ul><p>ThedollarlesseBWToutputstringB[1..N]isformedbyreading,foreachsortedrotation,thecharacterprecedingitsstartingposition,withthepositionsoffirst−rotationsalsorecorded.Notably,nosentinelsorexternalmarkersareeverintroducedatanystep—rotationsareentirelycircular(<ahref="/papers/2106.11191"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Boucheretal.,2021</a>).</p><p>ForasinglestringT(i.e.,m=1),thisprocessyieldsthecircularBWT,coincidingwiththeclassicalBWTmodulorotationchoice.</p><h2class=′paper−heading′id=′algorithmic−construction−linear−time−sentinel−free−approaches′>2.AlgorithmicConstruction:Linear−Time,Sentinel−FreeApproaches</h2><p>DollarlesseBWTcanbecomputedinlineartimeviaanadaptationoftheSAIS(SuffixArrayInducedSorting)paradigm.Themainstepsare:</p><ul><li><strong>CyclicS/L/LMSTyping:</strong>EachcyclicpositionislabelledasS−typeorL−typeaccordingtowhetheritsrotationprecedesorfollowsitsneighborunder\omega−order.LMS−positionsarethoseS−positionswhosepredecessorisL−type.</li><li><strong>Bucket−BasedInduction:</strong>Rotationsarebucketedbytheirfirstcharacter.LMS−positionsarefirstplacedintheircorrectbuckets,thenmultipleleft−to−rightandright−to−leftpassesinduceL−andS−typeorderings,respectively.ThisguaranteesLMS−substringsaresortedcorrectlyunder\omega−order.</li><li><strong>RecursiononNamedSubstrings:</strong>DistinctLMS−substringsarereplacedbyuniquenames,ashorterinstanceisconstructed,andrecursionproceedsuntilallnamesaredistinct.</li><li><strong>FinalInduction:</strong>Whenrecursionreturns,positionsaremappedbacktotheoriginalarray,andfurtherinductionfillsinthecompleterotationordering.Finally,length−1stringsaretriviallyhandled.</li></ul><p>TheentireprocedurerunsinO(N)timeandO(N)spaceforNtotalinputlength,andneverrequiresappendingasentinelorcomputingLyndonrotations.Thisconstructionunderpinsbothsingle−stringandmulti−string(collection)eBWTalgorithms(<ahref="/papers/2106.11191"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Boucheretal.,2021</a>).</p><h2class=′paper−heading′id=′combinatorics−of−sentinel−insertion−and−characterizations′>3.CombinatoricsofSentinelInsertionandCharacterizations</h2><p>ClassicalBWTisusuallydefinedforstringsv\$ (withappendedsentinel),butnoteverystringover(\Sigma \cup {\$\})^{n+1}isaBWTimage.Forastringw \in \Sigma^n(withoutsentinel),acentralcombinatorialproblemis:inwhichpositionsidoesinsertionofasentinel(betweenw_{i-1}andw_i)makew'$ a valid BWT-image? This is formalized as a "nice position".</p>
<p>Formally, for a permutation $\pi_winduced(bystablesorting)fromw$, define "pseudo-cycles" $S \subseteq \{1, \dots, n\},eachsplitasS_L < S_R,suchthat\pi(S) = (S_L - 1) \cup S_R.ThecriticalintervalR(S) = [\max(S_L)+1, \min(S_R)]andinsertionpositionsioutsideanyR(S)$ for all pseudo-cycles are exactly the "nice" ones, i.e., those yielding valid BWT images. All nice $ihavethesameparity,andavarietyofcombinatorialboundsonthenumberandlocationofsuchpositionshavebeenestablished(<ahref="/papers/1908.09125"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Giulianietal.,2019</a>).</p><p>ThischaracterizeswhichstringswcanbeviewedasBWTimagesofsomevwithasentinel.Theanswerdependsentirelyonthestructureof\pi_w—itscyclesandpseudo−cycles.</p><p>Ifatleastonenicepositionexists,theoriginalstringcanberecoveredasthedollarlesseBWTwithoutexplicitinsertionofthesentinel:selectanyniceposition,virtuallyinsertthesentinelforcomputation,thenremoveitfromtheoutputtoyieldafullycompatible,invertible,andindexabletransform.</p><h2class=′paper−heading′id=′practical−algorithms−construction−and−inversion′>4.PracticalAlgorithms:ConstructionandInversion</h2><p>Indirectimplementation,absentadedicatedsentinel−freeconstructionalgorithm,onecan:</p><ul><li>Concatenatethecollectionwithsentinels,formthestandardEBWTviathesuffixarrayoverT=S_1\$S_2\$\dots S_t\$,</li><li>IdentifyallsentinelpositionsintheoutputBWTandstoretheminabitvectorE[1..N],</li><li>RemoveallsentinelsfromthestoredtransformtoobtainthedollarlesseBWTL.</li></ul><p>Thepair(L, E)providesafullyinvertiblerepresentation:toinvert,mergeLandthemarkerpositionsgivenbyEtoreconstructthefullBWT,thenperformstandardLF−mappingtorecovertheoriginaltextorcollection(<ahref="/papers/1809.07320"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Gagieetal.,2018</a>).ForeachcharacterabuttingaremovedsentinelintheBWT,adjacentrunsmaynowcoalesce,yieldingamorecompressiblerepresentationfordownstreamindexingorstorage.</p><p>ThispracticalapproachenablesO(N)timeandspaceconstructionandinversion,withtheonlyoverheadbeingtheadditionalmarkerbitvectorwhosesizeisO(t \log(N/t) + t)bitsfortstrings.</p><p>Forlargecollectionsorhighlyrepetitivetexts,prefix−freeparsing(PFP)canserveasaneffectivescalablevariant.Here,phrasesbeginningandendingathash−basedtriggersareparsed,thedollarlesseBWToftheparseiscomputed,thentheoveralltransformisreconstructedblockwise.Themethodreducesbothmemoryandtimeforverylargegenomicsdatasets(<ahref="/papers/2106.11191"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Boucheretal.,2021</a>).</p><h2class=′paper−heading′id=′compression−and−indexing−benefits′>5.CompressionandIndexingBenefits</h2><p>Empiricalstudiesshowmeasurablebenefitsofomittingsentinelsinpractice.Forexample,onahumanchromosome19readset(concatenatedtextwithN$ large), constructing the standard EBWT with one \$per read yields roughly 220 million runs. Eliminating all \$’s from the output reduces the run count in L by approximately 19%, from 220 million to 178 million (Gagie et al., 2018). This reduction arises as runs of real characters adjacent to sentinels can now merge, providing more favorable run-length encodings essential for compressed index structures.
Subsequent application of tree-based eXtended BWTs (XBWT) using a labeled tree based on the reference genome further compresses the representation. In the same setting, the XBWT reduced the run count by an additional 15%, to approximately 150 million runs, establishing its utility for highly repetitive, reference-driven collections (Gagie et al., 2018).
The dollarless eBWT enables the direct application of FM-index and other compressed full-text indexing strategies on the transformed text, as LF-mapping, rank, and select operations carry over unchanged. Asymptotic time and space complexities remain equivalent to the sentinel-based approaches.
6. Variants and Extensions: Large-Scale and Reference-Aware Approaches
The theoretical and algorithmic frameworks for dollarless eBWT motivate further optimizations:
Scalable Construction with PFP: Using prefix-free parsing with rolling hash triggers allows the algorithm to handle datasets well beyond main memory, leveraging bucketwise and blockwise merging for both transform computation and inversion (Boucher et al., 2021).
Reference-Aware Transform (XBWT): When a reference genome is available and reads are aligned, a labeled tree structure permits application of Ferragina et al.’s XBWT. This structure clusters sequences based on path-labels rather than only local suffix contexts, yielding further compressibility and supporting full-text indexed search (Gagie et al., 2018).
Both avenues benefit from the absence of sentinels, as they mitigate redundant boundary symbols and enable unification of adjacent runs.
7. Summary Table: Dollarless eBWT Construction Approaches