BWT construction and search at the terabase scale

Published 1 Sep 2024 in q-bio.GN | (2409.00613v2)

Abstract: Motivation: Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions. These limitations have hampered the biological applications of full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query. Ropebwt3 indexed 320 assembled human genomes in 65 hours and indexed 7.3 terabases of commonly studied bacterial assemblies in 26 days. This was achieved using up to 170 gigabytes of memory at the peak without working disk space. Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties, and can retrieve similar local haplotypes matching a query sequence. It demonstrates the feasibility of full-text indexing at the terabase scale. Availability and implementation: https://github.com/lh3/ropebwt3

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces ropebwt3, an efficient tool for terabase-scale BWT construction using segmented processing and B+-tree merging with O(n log r) complexity.
Ropebwt3 demonstrates high scalability and memory efficiency, indexing 320 human genomes in 65 hours with 170 GB memory, enabling large-scale pangenomic analysis.
The tool offers enhanced query capabilities beyond exact matches, supporting MEMs and inexact alignments, making it highly applicable for various genomic analyses.

Analysis of Terabase-Scale Burrows-Wheeler Transform (BWT) Construction and Query

The paper "BWT construction and search at the terabase scale" details the development and implementation of ropebwt3, a tool for efficiently constructing and querying the Burrows-Wheeler Transform (BWT) on massive genomic datasets. BWT is a lossless data transformation method originally used for data compression. It has shown particular efficacy in applications where encoding redundant sequences—such as pangenomic data—is desired. However, traditional BWT construction has been resource-intensive and difficult to parallelize, limiting its utility in large-scale genomic research.

Key Contributions

The paper's central contributions can be summarized as follows:

Algorithmic Enhancement: Ropebwt3 introduces an innovative approach to BWT construction by breaking down large genomic datasets into manageable segments. This method uses libsais for partial multi-string BWT construction, which are subsequently merged using a B+-tree structured on run-length encoding. The complexity of this approach is $O(n \log r)$ , with $n$ representing the total number of symbols and $r$ the number of runs in the BWT.
Scalability and Performance: Ropebwt3 has been demonstrated to construct sizable indices, such as indexing 320 assembled human genomes in 65 hours using a peak memory of 170 GB. These benchmarks underscore the feasibility of full-text indexing at the terabase scale, a critical advancement for handling vast pangenomic datasets.
Memory Efficiency: The tool is designed to operate with reduced memory demands relative to its predecessors, circumventing the need for massive working disk space. This is a notable achievement, facilitating its application on standard computing infrastructure.
Enhanced Query Capability: Beyond exact matches, ropebwt3 supports finding maximal exact matches (MEMs), inexact alignments under affine-gap penalties, and retrieval of similar local haplotypes. This broadens its applicability, making it highly valuable for various genomic analyses including haplotype diversity estimation.
Sufficiency for Pangenomic Applications: Ropebwt3's ability to incrementally construct BWT without staging complete datasets is particularly suited to pangenomes, which require regular updates as new sequences become available.

Implications and Future Directions

From a theoretical standpoint, ropebwt3 illustrates progress toward resolving the challenges of working with highly redundant and voluminous data. It introduces a framework that capitalizes on the reduced memory footprint and increased efficiency of BWT data structures in genomics. Practically, this could pave the way for widespread adoption in biomedically significant domains like personalized medicine and epidemiological research, where rapid and scalable data processing is crucial.

Looking forward, some aspects merit further exploration. The incorporation of subsampled r-index structures, which provide a balance between memory usage and query speed, could enhance ropebwt3's applicability to even larger databases. Additionally, expanding support for more diverse query types would strengthen its role in comprehensive genomic analyses, especially in integrating with graph-based approaches currently dominant in the field.

In summary, this paper establishes a critical benchmark for BWT construction and query at a scale suitable for modern genomic datasets, with ropebwt3 providing a practical tool poised to benefit a wide array of genomic inquiries. The methodological innovations presented herein lay the groundwork for future exploration and enhancements in the domain of computational genomics.