Rawsamble: Overlapping and Assembling Raw Nanopore Signals using a Hash-based Seeding Mechanism

Published 23 Oct 2024 in q-bio.GN | (2410.17801v1)

Abstract: Raw nanopore signal analysis is a common approach in genomics to provide fast and resource-efficient analysis without translating the signals to bases (i.e., without basecalling). However, existing solutions cannot interpret raw signals directly if a reference genome is unknown due to a lack of accurate mechanisms to handle increased noise in pairwise raw signal comparison. Our goal is to enable the direct analysis of raw signals without a reference genome. To this end, we propose Rawsamble, the first mechanism that can 1) identify regions of similarity between all raw signal pairs, known as all-vs-all overlapping, using a hash-based search mechanism and 2) use these to construct genomes from scratch, called de novo assembly. Our extensive evaluations across multiple genomes of varying sizes show that Rawsamble provides a significant speedup (on average by 16.36x and up to 41.59x) and reduces peak memory usage (on average by 11.73x and up to by 41.99x) compared to a conventional genome assembly pipeline using the state-of-the-art tools for basecalling (Dorado's fastest mode) and overlapping (minimap2) on a CPU. We find that 36.57% of overlapping pairs generated by Rawsamble are identical to those generated by minimap2. Using the overlaps from Rawsamble, we construct the first de novo assemblies directly from raw signals without basecalling. We show that we can construct contiguous assembly segments (unitigs) up to 2.7 million bases in length (half the genome length of E. coli). We identify previously unexplored directions that can be enabled by finding overlaps and constructing de novo assemblies. Rawsamble is available at https://github.com/CMU-SAFARI/RawHash. We also provide the scripts to fully reproduce our results on our GitHub page.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents Rawsamble, a novel method that enables de novo genome assembly directly from raw nanopore signals by using a hash-based mechanism to identify overlaps, bypassing traditional basecalling.
Rawsamble demonstrates substantial performance improvements over basecalling pipelines, achieving a 16.36x speedup and 11.73x reduction in peak memory usage.
The method successfully generates high-fidelity overlaps and enables the de novo assembly of large genomic segments, paving the way for more streamlined and efficient genome analysis workflows.

Overview of "Rawsamble: Overlapping and Assembling Raw Nanopore Signals"

This paper presents "Rawsamble," a novel method for the direct analysis of raw nanopore sequencing signals, circumventing the need for basecalling—traditionally considered a prerequisite for genome analysis. Nanopore sequencing offers multiple advantages, including high throughput and the ability to sequence long DNA molecules. However, raw signals inherently contain noise, which complicates direct signal comparison methods. Rawsamble addresses this complexity by implementing a hash-based search mechanism to identify regions of similarity between raw signal pairs, suitable for de novo genome assembly tasks.

Novel Contributions

Rawsamble introduces several key innovations:

Hash-based All-vs-All Overlapping: Employing a hash-based search technique, Rawsamble efficiently identifies similar regions across all raw nanopore signal pairs within a dataset. This functionality positions Rawsamble as the first method enabling de novo genome assembly directly from raw signals.
Performance Enhancements: The method demonstrates substantial speed and memory efficiency improvements over conventional basecalling followed by overlap identification tools. On average, Rawsamble achieves a speedup of 16.36 times in terms of elapsed time and reduces peak memory usage by 11.73 times compared to state-of-the-art basecalling and overlapping pipelines.
High-fidelity Overlap Generation: In evaluations, about 36.57% of overlaps discovered by Rawsamble correspond directly to those found by traditional methods leveraging basecalled sequences, emphasizing the method's accuracy despite operating at the signal level.
De novo Assembly Capability: Rawsamble successfully performs de novo assembly, constructing contiguous sequence segments (unitigs) over significant genome lengths (e.g., up to 2.7 million bases for E. coli). This breakthrough marks a significant step toward streamlined genome assembly processes.

Implications and Future Directions

Rawsamble's ability to bypass basecalling introduces potential for accelerating genome analysis workflows by simplifying downstream data processing and enhancing real-time analysis capabilities. Removing the basecalling step could reduce computational costs and facilitate analysis in resource-constrained environments, such as field-based sequencing using portable devices.

Long-term, Rawsamble could serve as a foundational methodology for improving the accuracy and efficiency of genome assemblies and facilitating novel applications in genomics. Future work may explore further reducing noise impact through enhanced filtering techniques and optimizing overlap discovery for broader genomic contexts. Integrating Rawsamble-derived overlaps into existing basecalling platforms might yield more accurate nucleotide sequences by leveraging contextual overlap data during sequence mapping or variant calling.

Additionally, advancements in nanopore technology hardware might further improve the efficacy of techniques like Rawsamble by reducing raw signal noise, an area ripe for cross-disciplinary research between hardware and bioinformatics experts.

Conclusion

Rawsamble represents a significant leap in raw nanopore sequencing data handling, eliminating basecalling without sacrificing analytical capabilities. By providing fast, efficient, and direct signal-to-assembly pipelines, Rawsamble has opened new avenues for genetic analysis, promising streamlined workflows and broader applicability in various genomic research and clinical applications. The methodological insights and quantitative results outlined in this paper will likely influence future research directions in the field of high-throughput DNA sequencing technologies.