Papers
Topics
Authors
Recent
Search
2000 character limit reached

SynDe: Syndrome-guided Decoding of Raw Nanopore Reads

Published 1 Apr 2026 in cs.IT | (2604.01054v1)

Abstract: Nanopore sequencing technology remains highly error-prone, making efficient error correction essential in DNA-based data storage. Prior work addressed high error rates using convolutional codes with their decoder coupled with the basecaller, but such approaches only accommodate a limited number of code classes and incur significant decoding complexity. To overcome these limitations, we propose two algorithms: PrimerSeeker, which efficiently detects primer sequences in raw nanopore sequencing reads, and SynDe, a decoder that operates on the same raw reads and supports any linear error correction code with a low-complexity graphical representation. PrimerSeeker provides primer location estimates close to those of existing approaches while being better suited for real-time primer detection during sequencing. SynDe performs well with convolutional codes augmented with periodic markers, often approaching or exceeding the performance of existing algorithms with a lower time complexity. Remarkably, the confidence scores produced by SynDe reliably identify which of its outputs should be discarded.

Summary

  • The paper introduces a syndrome-guided sequential decoding algorithm, SynDe, that reduces nanopore error rates and improves computational efficiency.
  • The paper presents an innovative primer localization method, PrimerSeeker, for boundary detection using k-mer dwell-time features, achieving over 97% localization accuracy.
  • The framework supports diverse linear error-correcting codes with linear-time complexity, enabling real-time decoding for DNA storage applications.

Syndrome-Guided Decoding of Raw Nanopore Reads: The SynDe Framework

Introduction

The persistent high error rates of nanopore sequencing—particularly insertions, deletions, and substitutions—constitute a major challenge for DNA-based data storage. Conventional decoding solutions for such high-noise environments frequently resort to consensus across multiple reads, undermining physical information density and computational efficiency. Contemporary approaches attempt to mitigate these deficiencies through the integration of convolutional code decoding with basecalling neural networks, but are limited in supported code classes and suffer from prohibitive computational complexity. This paper introduces two algorithmic innovations—PrimerSeeker and SynDe—that collectively enable efficient, syndrome-guided end-to-end decoding of raw nanopore reads.

Syndrome-Guided Sequential Decoding

At the heart of the proposed solution is SynDe, a sequential decoder that replaces the code-constrained basecalling paradigm with a syndrome-trellis-based beam search. Unlike prior basecaller-decoder integration schemes that operate directly on the full state trellis of convolutional codes and consequently scale exponentially in memory parameter ν\nu, SynDe leverages the syndrome trellis representation to guarantee time complexity linear in read length and independent of code memory.

The syndrome-trellis framework allows maximal flexibility for the choice of underlying codes. Any linear error-correcting code admitting a compact syndrome trellis is supported, including high-rate and high-memory convolutional codes, concatenations with marker codes for synchronization, and straightforward extensions to more general codes. This addresses the critical shortcoming of existing approaches such as BeamTrellis [chandakOvercomingHighNanopore2020] and AlignmentMatrix [volkel_nanopore_2025], which either restrict supported code classes or scale exponentially in key parameters.

The SynDe algorithm operates directly on the probability matrices output by current state-of-the-art neural basecallers (e.g., Bonito, Lokatt), performing beam search exclusively over codeword-consistent paths in the syndrome trellis. Frame Error Rate (FER) benchmarks on experimental datasets demonstrate that, at equivalent discard rates, SynDe consistently matches or exceeds the accuracy of previous approaches at lower computational cost—e.g., SynDe-CTC achieves 0.5% FER compared to 1.5% for BeamTrellis at similar code rates (0.7 vs. 0.68) and at 65% read discard, representing a 3x error reduction (2604.01054).

Primer Sequence Localization with PrimerSeeker

A further bottleneck in basecaller-integrated decoders is reliable localization of payload boundaries—specifically, the detection of flanking primers—from the raw current signal. Traditional pipelines perform two complete basecalling passes: one for basecalled localization, an edit distance computation of probable primer alignments, and a second, code-constrained pass for data recovery, with resulting inefficiency and delayed feedback.

PrimerSeeker directly addresses this limitation by executing a constrained beam search for primer sequences within the neural probability matrix, eschewing the need for initial basecalling. The algorithm exploits the statistical dwell-time features of k-mers in raw nanopore signals, propagating beams for candidate primer positions, marginalizing over stochastic transition paths, and dynamically pruning redundant computations. Both CTC-based and explicit-duration HMM (Lokatt) variants are provided. On experimental datasets, PrimerSeeker achieves >97% agreement (within 50 samples/~5 nt) compared to full-signal basecalling approaches, with mean computational complexity essentially equivalent to standard basecallers (e.g., mean beam complexity of PrimerSeeker-CTC at 33.28 vs. 28.9 for 5-beam Bonito basecalling) (2604.01054). This enables low-latency, on-the-fly decoding workflows.

Coding Schemes and Marker Integration

The syndrome-trellis-centric SynDe framework supports arbitrary linear coding schemes but is evaluated extensively in the convolutional/marker code regime. Marker codes are employed by periodic symbol insertion to mitigate misalignment caused by dwell-time uncertainty and indel propagation. Empirically, the joint use of high-memory convolutional and marker codes with SynDe achieves superior discard/FER tradeoffs compared to state-of-the-art, without the rate penalty or complexity blowup observed with approaches requiring puncturing or CRC augmentation of convolutional codes (2604.01054).

Additionally, unlike Viterbi-based approaches, the sequential beam search decoder yields output likelihood scores which are robust predictors of decoding reliability, allowing flexible acceptance thresholds in place of static CRC checks. This thresholding can be tuned to target application-specific trade-offs between throughput and guaranteeable error rates.

Complexity and Universality

Key to the advances offered by SynDe and PrimerSeeker is the shift from exponentially scaling dynamic programs over the full code trellis towards linear-time sequential decoding in syndrome space. For SynDe, both time and space complexities are O(WN)O(WN), where WW is the beam width (kept modest, e.g. 512), and NN the number of signal samples analyzed, independent of code memory or rate. For BeamTrellis, complexity scales as O(M2N2ν)O(M^2N2^\nu), i.e., quadratic in payload and exponential in memory, rendering high-memory or high-rate codes impractical (2604.01054).

This universality also extends to the supported error-correcting codes, allowing basecaller-integration for both standard convolutional constructions and compact syndrome trellis linear codes, not just those amenable to direct encoder-state-based Viterbi search.

Practical and Theoretical Implications

From a practical DNA storage systems perspective, SynDe and PrimerSeeker facilitate real-time, resource-efficient decoding compatible with arbitrary linear error correction strategies at rates and payload lengths inaccessible to previous methods. Their ability to combine diverse code classes, marker strategies, and direct raw signal operation streamlines pipeline integration, enabling on-line or adaptive processing scenarios not otherwise feasible.

Theoretically, this work demonstrates that syndrome-trellis-based algorithms, long studied for channels with synchronization errors [sidorenkoDecodingConvolutionalCodes1994,rosnesMaximumLengthConvolutional2004], can be effectively paired with neural basecallers to exploit the rich "soft" information inherent in modern nanopore sequencing data. Future developments could include variants based on Fano or stack decoding heuristics, iterative soft-input/soft-output decoding, and more elaborate marginalization schemes that integrate outer codes as well as further synchronization constructs.

Notably, the architectural separation of soft basecalling likelihood extraction, online boundary detection, and codeword-constrained sequential search suggests significant extensibility to next-generation nanopore/other long-read sequencing protocols with more complex error models or larger underlying alphabets.

Conclusion

This work introduces a computational framework for raw nanopore read decoding that achieves both generality and efficiency through syndrome-trellis-based sequential decoding (SynDe) and direct primer localization (PrimerSeeker) (2604.01054). Empirical evaluation demonstrates substantial reductions in computational complexity, robust performance across code families and real sequencing datasets, and credible reliability estimates from output likelihoods. Collectively, these techniques significantly advance the practicality of DNA-based data storage systems utilizing noisy, high-throughput nanopore sequencing, narrowing the performance gap to more mature sequencing technologies, and opening avenues for further innovation in code design and neural sequence analysis.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.