In-Place BWT Algorithm

Updated 31 December 2025

In-place BWT is an algorithm that computes the Burrows–Wheeler Transform by continually overwriting the input buffer with only a few auxiliary variables.
It employs lexicographic insertion and controlled cyclic shifts to sequentially build suffix-related structures while operating in constant extra space despite quadratic runtime.
Recent extensions include in-place LCP and Lyndon array computations as well as hardware pipeline implementations that achieve fixed-cycle throughput for constrained memory systems.

The in-place Burrows–Wheeler Transform (BWT) algorithm is a class of procedures that compute the BWT of a string by continuously overwriting the input buffer, maintaining only a constant number of auxiliary variables. Starting with Crochemore et al.’s foundational method, recent research has extended the technique to construct additional suffix-related arrays in constant workspace and provided hardware implementations with fixed-cycle throughput. These algorithms have also facilitated direct in-place computation of Lyndon-related structures and bijective variants. Despite quadratic runtime, in-place BWT and its extensions remain central to time/space tradeoffs in suffix processing for both software and hardware applications.

1. Algorithmic Foundations and Lexicographic Insertion

The original in-place BWT algorithm of Crochemore et al. operates by incrementally constructing the BWT of progressively longer suffixes $T[s..n-1]$ of the input $T = T_0 T_1 \ldots T_{n-2} \$$. For each index $s$ descending from $n-2%%%%4$ T[s..n-1]$5%%%% in the current suffix buffer, determines the rank $r $of$ T_s $among all suffixes$ T_s, T_{s+1}, \ldots, T_{n-1} $, and performs a controlled cyclic shift to integrate$ T[s] $at the correct lexicographic position.</p> <p>The rank$ T = T_0 T_1 \ldots T_{n-2} \$0 is given by:

$T = T_0 T_1 \ldots T_{n-2} \$1

where $T = T_0 T_1 \ldots T_{n-2} \$2 and $T = T_0 T_1 \ldots T_{n-2} \$3 is the location of \$T = T_0 T_1 \ldots T_{n-2} \$4T[s+1..n-1] $.</p> <p>This ensures the BWT is built by a series of local, in-place updates, preserving a lexicographically sorted suffix ordering with only$ T = T_0 T_1 \ldots T_{n-2} \$5 extra integer variables (indices, counters, loop variables) (Louza et al., 24 Dec 2025).

2. Pseudocode, Buffer Invariants, and Rank Maintenance

The computational steps for each iteration $T = T_0 T_1 \ldots T_{n-2} \$6 are succinctly captured in the following routine: $to$ 9 After each step, the suffix buffer $T = T_0 T_1 \ldots T_{n-2} \$7 holds the BWT of $T = T_0 T_1 \ldots T_{n-2} \$8. If the inverse suffix array (ISA) computation is enabled, lexicographic ranks are incrementally updated in-place: any rank $T = T_0 T_1 \ldots T_{n-2} \$9 becomes $. For each index$ 0, while $. For each index$ 1 is assigned $. For each index$ 2 (Louza et al., 24 Dec 2025).

3. Space/Time Bounds and Extensions to LCP, Lyndon, and Bijective Transforms

In-place BWT runs in $. For each index$ 3 time: each of $. For each index$ 4 iterations scans and shifts $. For each index$ 5 elements. No auxiliary arrays or stacks are allocated; only $. For each index$ 6 extra variables are used.

Several extensions have been realized in the same framework:

LCP array construction: By adding two scans per suffix insertion, the longest common prefix (LCP) values are computed and shifted in constant extra space along with the BWT. Elias δ-coding enables further in-place compression of the LCP array (Louza et al., 2016).
Lyndon array computation: After building the ISA, the Lyndon array is produced in-place by performing a next-smaller-value (NSV) scan: for each $. For each index$ 7, $. For each index$ 8. This is implemented via a double loop overwriting ISA values, also in $. For each index$ 9 time (Louza et al., 24 Dec 2025).
Bijective BWT and conversions: The in-place paradigm extends to the bijective BWT (BBWT), leveraging Duval’s Lyndon factorization. Factor-by-factor insertion maintains EBWT order; in-place inversion or conversion between BWT and BBWT is also quadratic in time, using only constant extra workspace (Köppl et al., 2020).

4. Hardware Pipeline Realizations

The in-place BWT algorithm lends itself to hardware accelerators due to its regular update pattern and fixed workspace. In a register-based scanchain architecture, each input block is held in a sequence of flip-flops. On each iteration, the chain is shifted, new character loaded, and the insertion rank computed using parallel comparators. Population counts for “ $descending from$ 0” and “ $descending from$ 1” flags determine the new insertion index; updates are distributed over a fixed pipeline of six clock cycles per character, yielding input-independent, constant-latency execution.

Reported throughputs for FPGAs and ASICs demonstrate practical feasibility:

FPGA: Xilinx VU9P (no BRAM); 66 MB/s for 128-byte blocks at 345 MHz.
ASIC: 65 nm CMOS; 161 MB/s for 128-byte blocks at 843 MHz (Stangherlin et al., 2022).

The block buffer contains only the working string, with all updates performed without supplementary RAM or output arrays.

5. Correctness, Inductive Invariants, and Suffix Structures

After iteration $descending from$ 2, $descending from$ 3 encodes both $descending from$ 4 and, optionally, $descending from$ 5. The rank-update lemma $descending from$ 6 for $descending from$ 7 and $descending from$ 8 guarantees consistent ordering. For LCP or Lyndon array computation, additional scans respect the sorted neighbor relations through local rank properties and LF-mapping, with $descending from$ 9 auxiliary variables at each step.

The algorithms retain correctness under unbounded alphabets, as all decisions are based only on local comparisons and position updates, not enumeration or counting across the alphabet $to$ 0.

6. Space-Time Tradeoffs and Compression Techniques

If $to$ 1 words of workspace are permitted, suffixes can be inserted in batches of size $to$ 2, with $to$ 3 scans per batch, resulting in a total time of $to$ 4 and space $to$ 5, where $to$ 6 is the maximum alphabet size in any $to$ 7-window (Louza et al., 2016).

For LCP array compression, Elias δ-coding achieves $to$ 8 bits on average, shifting and inserting codewords in-place using word-level memmoves and scan-to-decode operations.

The constant-space in-place paradigm thus establishes the quadratic lower bound for maintaining all suffix-derived structures in the buffer, unless nontrivial sampling or parallel bit-vector techniques are introduced on highly repetitive texts.

7. Applications, Extensions, and Limitations

In-place BWT algorithms support a range of theoretical and practical applications:

Suffix and Lyndon array construction for indexing and pattern matching.
Compression schemes such as bzip2 (as realized in hardware).
Conversions between classical and bijective transforms for invertibility and unique factorization.

A plausible implication is that, despite the quadratic time, the conceptual simplicity and generality—applicability to unbounded alphabets and lack of workspace dependency—render the in-place BWT invaluable in constrained-memory environments and as a reference procedure for time-space tradeoff analyses.

The in-place BWT method is not intended for high throughput in large-scale settings but rather as a model demonstrating the feasibility of full suffix/transform computation under the strongest workspace restriction (Louza et al., 24 Dec 2025).

Table: In-Place BWT Algorithm Features and Extensions

Extension	Description	Paper (arXiv id)
Lyndon array	Next-smaller-value scan from ISA	(Louza et al., 24 Dec 2025)
LCP array	2 scans per suffix insertion	(Louza et al., 2016)
Elias-coded LCP	In-place δ-encoding during shifts	(Louza et al., 2016)
BBWT construction	Factor-by-factor via Duval algorithm	(Köppl et al., 2020)
Hardware pipeline	6-cycle scanchain, parallel popcount	(Stangherlin et al., 2022)

All the above extensions preserve the fundamental invariants and operate within constant extra space, illustrating the power of this paradigm across suffix-related array computation and transform variants.