Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dollarless eBWT: Sentinel-Free Transform

Updated 22 January 2026
  • Dollarless eBWT is a sentinel-free extended Burrows–Wheeler Transform that processes cyclic rotations under ω-order to enable efficient text indexing and genome analysis.
  • It employs a linear-time, SAIS-inspired algorithm using cyclic S/L typing, bucket-based induction, and recursive naming to sort rotations without added markers.
  • Its practical benefits include reduced run counts and improved compressibility, making it ideal for high-throughput genomic datasets and compressed full‐text indexing.

The dollarless eBWT (extended Burrows–Wheeler Transform) encompasses a class of transformations that extend or reconstruct Burrows–Wheeler-style text permutations, originally devised for string compression and indexing, but explicitly without the need for end-marker (sentinel, dollar "$&quot;) characters. This approach has significant theoretical, algorithmic, and practical ramifications for text collections, compressed indexing, and genome analysis, providing both efficiency improvements and new guarantees over classical constructions.</p> <h2 class='paper-heading' id='formal-definitions-and-core-properties'>1. Formal Definitions and Core Properties</h2> <p>The dollarless eBWT generalizes the classic BWT by omitting any special end-of-string character. Consider a multiset of primitive (non-periodic) strings $\mathcal M = \{T_1, \dots, T_m\}overafinite,orderedalphabet over a finite, ordered alphabet \Sigma,where, where |T_i| = n_iandtotallength and total length N = \sum_i n_i.Foreach. For each T_i,all, all n_icyclicrotationsareconsidered,andindicesaretreatedmodulo cyclic rotations are considered, and indices are treated modulo n_i.Thegeneralizedconjugatearrayisobtainedbycollectingallthesecyclicrotationsandsortingthemunderthesocalled. The generalized conjugate array is obtained by collecting all these cyclic rotations and sorting them under the so-called \omegaorder,characterizedby:</p><ul><li>For-order, characterized by:</p> <ul> <li>For U, V \in \Sigma^*,, U \prec_\omega Vif(a) if (a) U, Vhavethesameprimitiverootand have the same primitive root and U <_{\text{lex}} V,or(b), or (b) U^\omega <_{\text{lex}} V^\omegaotherwise,with otherwise, with U^\omegadenotinginfiniterepetitionof denoting infinite repetition of U.</li></ul><p>ThedollarlesseBWToutputstring.</li> </ul> <p>The dollarless eBWT output string B[1..N]isformedbyreading,foreachsortedrotation,thecharacterprecedingitsstartingposition,withthepositionsoffirstrotationsalsorecorded.Notably,nosentinelsorexternalmarkersareeverintroducedatanysteprotationsareentirelycircular(<ahref="/papers/2106.11191"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Boucheretal.,2021</a>).</p><p>Forasinglestring is formed by reading, for each sorted rotation, the character preceding its starting position, with the positions of first-rotations also recorded. Notably, no sentinels or external markers are ever introduced at any step—rotations are entirely circular (<a href="/papers/2106.11191" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Boucher et al., 2021</a>).</p> <p>For a single string T(i.e., (i.e., m=1),thisprocessyieldsthecircularBWT,coincidingwiththeclassicalBWTmodulorotationchoice.</p><h2class=paperheadingid=algorithmicconstructionlineartimesentinelfreeapproaches>2.AlgorithmicConstruction:LinearTime,SentinelFreeApproaches</h2><p>DollarlesseBWTcanbecomputedinlineartimeviaanadaptationoftheSAIS(SuffixArrayInducedSorting)paradigm.Themainstepsare:</p><ul><li><strong>CyclicS/L/LMSTyping:</strong>EachcyclicpositionislabelledasStypeorLtypeaccordingtowhetheritsrotationprecedesorfollowsitsneighborunder), this process yields the circular BWT, coinciding with the classical BWT modulo rotation choice.</p> <h2 class='paper-heading' id='algorithmic-construction-linear-time-sentinel-free-approaches'>2. Algorithmic Construction: Linear-Time, Sentinel-Free Approaches</h2> <p>Dollarless eBWT can be computed in linear time via an adaptation of the SAIS (Suffix Array Induced Sorting) paradigm. The main steps are:</p> <ul> <li><strong>Cyclic S/L/LMS Typing:</strong> Each cyclic position is labelled as S-type or L-type according to whether its rotation precedes or follows its neighbor under \omegaorder.LMSpositionsarethoseSpositionswhosepredecessorisLtype.</li><li><strong>BucketBasedInduction:</strong>Rotationsarebucketedbytheirfirstcharacter.LMSpositionsarefirstplacedintheircorrectbuckets,thenmultiplelefttorightandrighttoleftpassesinduceLandStypeorderings,respectively.ThisguaranteesLMSsubstringsaresortedcorrectlyunder-order. LMS-positions are those S-positions whose predecessor is L-type.</li> <li><strong>Bucket-Based Induction:</strong> Rotations are bucketed by their first character. LMS-positions are first placed in their correct buckets, then multiple left-to-right and right-to-left passes induce L- and S-type orderings, respectively. This guarantees LMS-substrings are sorted correctly under \omegaorder.</li><li><strong>RecursiononNamedSubstrings:</strong>DistinctLMSsubstringsarereplacedbyuniquenames,ashorterinstanceisconstructed,andrecursionproceedsuntilallnamesaredistinct.</li><li><strong>FinalInduction:</strong>Whenrecursionreturns,positionsaremappedbacktotheoriginalarray,andfurtherinductionfillsinthecompleterotationordering.Finally,length1stringsaretriviallyhandled.</li></ul><p>Theentireprocedurerunsin-order.</li> <li><strong>Recursion on Named Substrings:</strong> Distinct LMS-substrings are replaced by unique names, a shorter instance is constructed, and recursion proceeds until all names are distinct.</li> <li><strong>Final Induction:</strong> When recursion returns, positions are mapped back to the original array, and further induction fills in the complete rotation ordering. Finally, length-1 strings are trivially handled.</li> </ul> <p>The entire procedure runs in O(N)timeand time and O(N)spacefor space for Ntotalinputlength,andneverrequiresappendingasentinelorcomputingLyndonrotations.Thisconstructionunderpinsbothsinglestringandmultistring(collection)eBWTalgorithms(<ahref="/papers/2106.11191"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Boucheretal.,2021</a>).</p><h2class=paperheadingid=combinatoricsofsentinelinsertionandcharacterizations>3.CombinatoricsofSentinelInsertionandCharacterizations</h2><p>ClassicalBWTisusuallydefinedforstrings total input length, and never requires appending a sentinel or computing Lyndon rotations. This construction underpins both single-string and multi-string (collection) eBWT algorithms (<a href="/papers/2106.11191" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Boucher et al., 2021</a>).</p> <h2 class='paper-heading' id='combinatorics-of-sentinel-insertion-and-characterizations'>3. Combinatorics of Sentinel Insertion and Characterizations</h2> <p>Classical BWT is usually defined for strings v\$ (withappendedsentinel),butnoteverystringover(with appended sentinel), but not every string over(\Sigma \cup {\$\})^{n+1}isaBWTimage.Forastring is a BWT image. For a string w \in \Sigma^n(withoutsentinel),acentralcombinatorialproblemis:inwhichpositions (without sentinel), a central combinatorial problem is: in which positions idoesinsertionofasentinel(between does insertion of a sentinel (between w_{i-1}and and w_i)make) make w'$ a valid BWT-image? This is formalized as a &quot;nice position&quot;.</p> <p>Formally, for a permutation $\pi_winduced(bystablesorting)from induced (by stable sorting) from w$, define &quot;pseudo-cycles&quot; $S \subseteq \{1, \dots, n\},eachsplitas, each split as S_L < S_R,suchthat, such that \pi(S) = (S_L - 1) \cup S_R.Thecriticalinterval. The critical interval R(S) = [\max(S_L)+1, \min(S_R)]andinsertionpositions and insertion positions ioutsideany outside any R(S)$ for all pseudo-cycles are exactly the &quot;nice&quot; ones, i.e., those yielding valid BWT images. All nice $ihavethesameparity,andavarietyofcombinatorialboundsonthenumberandlocationofsuchpositionshavebeenestablished(<ahref="/papers/1908.09125"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Giulianietal.,2019</a>).</p><p>Thischaracterizeswhichstrings have the same parity, and a variety of combinatorial bounds on the number and location of such positions have been established (<a href="/papers/1908.09125" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Giuliani et al., 2019</a>).</p> <p>This characterizes which strings wcanbeviewedasBWTimagesofsome can be viewed as BWT images of some vwithasentinel.Theanswerdependsentirelyonthestructureof with a sentinel. The answer depends entirely on the structure of \pi_witscyclesandpseudocycles.</p><p>Ifatleastonenicepositionexists,theoriginalstringcanberecoveredasthedollarlesseBWTwithoutexplicitinsertionofthesentinel:selectanyniceposition,virtuallyinsertthesentinelforcomputation,thenremoveitfromtheoutputtoyieldafullycompatible,invertible,andindexabletransform.</p><h2class=paperheadingid=practicalalgorithmsconstructionandinversion>4.PracticalAlgorithms:ConstructionandInversion</h2><p>Indirectimplementation,absentadedicatedsentinelfreeconstructionalgorithm,onecan:</p><ul><li>Concatenatethecollectionwithsentinels,formthestandardEBWTviathesuffixarrayover—its cycles and pseudo-cycles.</p> <p>If at least one nice position exists, the original string can be recovered as the dollarless eBWT without explicit insertion of the sentinel: select any nice position, virtually insert the sentinel for computation, then remove it from the output to yield a fully compatible, invertible, and indexable transform.</p> <h2 class='paper-heading' id='practical-algorithms-construction-and-inversion'>4. Practical Algorithms: Construction and Inversion</h2> <p>In direct implementation, absent a dedicated sentinel-free construction algorithm, one can:</p> <ul> <li>Concatenate the collection with sentinels, form the standard EBWT via the suffix array over T=S_1\$S_2\$\dots S_t\$,</li><li>IdentifyallsentinelpositionsintheoutputBWTandstoretheminabitvector,</li> <li>Identify all sentinel positions in the output BWT and store them in a bitvector E[1..N],</li><li>RemoveallsentinelsfromthestoredtransformtoobtainthedollarlesseBWT,</li> <li>Remove all sentinels from the stored transform to obtain the dollarless eBWT L.</li></ul><p>Thepair.</li> </ul> <p>The pair (L, E)providesafullyinvertiblerepresentation:toinvert,merge provides a fully invertible representation: to invert, merge Landthemarkerpositionsgivenby and the marker positions given by EtoreconstructthefullBWT,thenperformstandardLFmappingtorecovertheoriginaltextorcollection(<ahref="/papers/1809.07320"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Gagieetal.,2018</a>).ForeachcharacterabuttingaremovedsentinelintheBWT,adjacentrunsmaynowcoalesce,yieldingamorecompressiblerepresentationfordownstreamindexingorstorage.</p><p>ThispracticalapproachenablesO(N)timeandspaceconstructionandinversion,withtheonlyoverheadbeingtheadditionalmarkerbitvectorwhosesizeis to reconstruct the full BWT, then perform standard LF-mapping to recover the original text or collection (<a href="/papers/1809.07320" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Gagie et al., 2018</a>). For each character abutting a removed sentinel in the BWT, adjacent runs may now coalesce, yielding a more compressible representation for downstream indexing or storage.</p> <p>This practical approach enables O(N) time and space construction and inversion, with the only overhead being the additional marker bitvector whose size is O(t \log(N/t) + t)bitsfor bits for tstrings.</p><p>Forlargecollectionsorhighlyrepetitivetexts,prefixfreeparsing(PFP)canserveasaneffectivescalablevariant.Here,phrasesbeginningandendingathashbasedtriggersareparsed,thedollarlesseBWToftheparseiscomputed,thentheoveralltransformisreconstructedblockwise.Themethodreducesbothmemoryandtimeforverylargegenomicsdatasets(<ahref="/papers/2106.11191"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Boucheretal.,2021</a>).</p><h2class=paperheadingid=compressionandindexingbenefits>5.CompressionandIndexingBenefits</h2><p>Empiricalstudiesshowmeasurablebenefitsofomittingsentinelsinpractice.Forexample,onahumanchromosome19readset(concatenatedtextwith strings.</p> <p>For large collections or highly repetitive texts, prefix-free parsing (PFP) can serve as an effective scalable variant. Here, phrases beginning and ending at hash-based triggers are parsed, the dollarless eBWT of the parse is computed, then the overall transform is reconstructed blockwise. The method reduces both memory and time for very large genomics datasets (<a href="/papers/2106.11191" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Boucher et al., 2021</a>).</p> <h2 class='paper-heading' id='compression-and-indexing-benefits'>5. Compression and Indexing Benefits</h2> <p>Empirical studies show measurable benefits of omitting sentinels in practice. For example, on a human chromosome 19 readset (concatenated text with N$ large), constructing the standard EBWT with one \$per read yields roughly 220 million runs. Eliminating all \$’s from the output reduces the run count in LL by approximately 19%, from 220 million to 178 million (Gagie et al., 2018). This reduction arises as runs of real characters adjacent to sentinels can now merge, providing more favorable run-length encodings essential for compressed index structures.

Subsequent application of tree-based eXtended BWTs (XBWT) using a labeled tree based on the reference genome further compresses the representation. In the same setting, the XBWT reduced the run count by an additional 15%, to approximately 150 million runs, establishing its utility for highly repetitive, reference-driven collections (Gagie et al., 2018).

The dollarless eBWT enables the direct application of FM-index and other compressed full-text indexing strategies on the transformed text, as LF-mapping, rank, and select operations carry over unchanged. Asymptotic time and space complexities remain equivalent to the sentinel-based approaches.

6. Variants and Extensions: Large-Scale and Reference-Aware Approaches

The theoretical and algorithmic frameworks for dollarless eBWT motivate further optimizations:

  • Scalable Construction with PFP: Using prefix-free parsing with rolling hash triggers allows the algorithm to handle datasets well beyond main memory, leveraging bucketwise and blockwise merging for both transform computation and inversion (Boucher et al., 2021).
  • Reference-Aware Transform (XBWT): When a reference genome is available and reads are aligned, a labeled tree structure permits application of Ferragina et al.’s XBWT. This structure clusters sequences based on path-labels rather than only local suffix contexts, yielding further compressibility and supporting full-text indexed search (Gagie et al., 2018).

Both avenues benefit from the absence of sentinels, as they mitigate redundant boundary symbols and enable unification of adjacent runs.

7. Summary Table: Dollarless eBWT Construction Approaches

Approach Sentinel Needed? Time Complexity Memory Overhead
Direct SAIS-eBWT (Boucher et al., 2021) No O(N)O(N) O(N)O(N)
SA/EBWT+strip (Gagie et al., 2018) Yes (virtual) O(N)O(N)/O(NlogN)O(N\log N) O(N)O(N) text, O(tlog(N/t))O(t\log(N/t)) bits for E
Prefix-Free Parsing No O(N)O(N) O(D+Np)O(|D| + N_p)

Here, NN is total input length, tt is string count, D|D| is dictionary size, NpN_p is total phrase count.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dollarless eBWT.