Papers
Topics
Authors
Recent
Search
2000 character limit reached

Revisiting K-mer Profile for Effective and Scalable Genome Representation Learning

Published 4 Nov 2024 in cs.LG, cs.AI, cs.CE, and q-bio.GN | (2411.02125v1)

Abstract: Obtaining effective representations of DNA sequences is crucial for genome analysis. Metagenomic binning, for instance, relies on genome representations to cluster complex mixtures of DNA fragments from biological samples with the aim of determining their microbial compositions. In this paper, we revisit k-mer-based representations of genomes and provide a theoretical analysis of their use in representation learning. Based on the analysis, we propose a lightweight and scalable model for performing metagenomic binning at the genome read level, relying only on the k-mer compositions of the DNA fragments. We compare the model to recent genome foundation models and demonstrate that while the models are comparable in performance, the proposed model is significantly more effective in terms of scalability, a crucial aspect for performing metagenomic binning of real-world datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Adenike A. Akinsemolu. The role of microorganisms in achieving the sustainable development goals. Journal of Cleaner Production, 182:139–155, 2018.
  2. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nature Biotechnology, 31(6):533–538, 2013.
  3. Genbank. Nucleic acids research, 41(D1):D36–D42, 2012.
  4. Over- and under-representation of short oligonucleotides in DNA sequences. Proceedings of the National Academy of Sciences, 89(4):1358–1362, 1992.
  5. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. BioRxiv, pages 2023–01, 2023.
  6. Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis. BMC Bioinformatics, 17:38, 2016.
  7. The dna sequence and biological annotation of human chromosome 1. Nature, 441(7091):315–321, 2006.
  8. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
  9. MetaBAT 2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ, 7:e7359, 2019.
  10. metaFlye: Scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11):1103–1110, 2020.
  11. Binning unassembled short reads based on k-mer abundance covariance using sparse coding. GigaScience, 9(4):giaa028, 2020.
  12. Metagenomic Binning using Connectivity-constrained Variational Autoencoders. In Proceedings of the 40th International Conference on Machine Learning, pages 18471–18481. PMLR, 2023.
  13. Critical assessment of metagenome interpretation: the second round of challenges. Nature methods, 19(4):429–440, 2022.
  14. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
  15. Patrick Ng. dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279, 2017.
  16. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems, 36, 2024.
  17. Improved metagenome binning and assembly using deep variational autoencoders. Nature Biotechnology, 39(5):555–560, 2021.
  18. SemiBin2: Self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing. Bioinformatics, 39(Supplement_1):i21–i29, 2023.
  19. Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle. Cell, 176(3):649–662.e20, 2019.
  20. Pavel A. Pevzner. Dna physical mapping and alternating eulerian cycles in colored graphs. Algorithmica, 13(1-2):77–105, 1995.
  21. Kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding. 29(9):1001–1021, 2022.
  22. KMCP: Accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics, 39(1):btac845, 2023.
  23. Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nature Communications, 12(1):2009, 2021.
  24. The contribution of microbial biotechnology to sustainable development goals. Microbial Biotechnology, 10(5):984–987, 2017.
  25. Esko Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical computer science, 92(1):191–211, 1992.
  26. Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters. Genome Research, page gr.278623.123, 2024.
  27. Trycycler: Consensus long-read assemblies for bacterial genomes. Genome Biology, 22(1):266, 2021.
  28. MaxBin 2.0: An automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics, 32(4):605–607, 2016.
  29. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Computational and Structural Biotechnology Journal, 19:6301–6314, 2021.
  30. Protein space: A natural method for realizing the nature of protein universe. Journal of Theoretical Biology, 318:197–204, February 2013.
  31. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.
  32. Dnabert-s: Learning species-aware dna embedding with genome foundation models. arXiv preprint arXiv:2402.08777, 2024.
  33. Multiple kernel representation learning on networks. IEEE Transactions on Knowledge and Data Engineering, 35(6):6113–6125, 2023.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.