SpanSeq: Similarity-based sequence data splitting method for improved development and assessment of deep learning projects
Abstract: The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to generalize), it is common to split the available data randomly into development (train/validation) and test sets. This procedure, although standard, has been shown to produce dubious assessments of generalization due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of two state-of-the-art models on bioinformatics, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development. SpanSeq is available at https://github.com/genomicepidemiology/SpanSeq.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pages 267–284, 2019.
- Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290, 2022.
- Identity crisis: Memorization and generalization under extreme overparameterization. arXiv preprint arXiv:1902.04698, 2019.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Neural network studies. 1. comparison of overfitting and overtraining. Journal of chemical information and computer sciences, 35(5):826–833, 1995.
- Satrajit Chatterjee. Learning and memorization. In International conference on machine learning, pages 755–763. PMLR, 2018.
- What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
- Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
- Memorization vs. generalization: quantifying data leakage in NLP performance evaluation. arXiv preprint arXiv:2102.01818, 2021.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
- Assessment of PLSDA cross validation. Metabolomics, 4:81–89, 2008.
- Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In The International Joint Conference on AI, volume 14, pages 1137–1145. Montreal, Canada, 1995.
- Representative subset selection. Analytica chimica acta, 468(1):91–103, 2002.
- Peter de Boves Harrington. Multiple versus single set validation of multivariate models to avoid mistakes. Critical reviews in analytical chemistry, 48(1):33–46, 2018.
- On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11:2079–2107, 2010.
- On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. Journal of analysis and testing, 2(3):249–262, 2018.
- Vladimir N Vapnik. The nature of statistical learning theory. Springer, 1995.
- Inflation of test accuracy due to data leakage in deep learning-based classification of oct images. Scientific Data, 9(1):580, 2022.
- We need to talk about random splits. arXiv preprint arXiv:2005.00636, 2020.
- Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143–153, 2019.
- Selection of representative protein data sets. Protein Science, 1(3):409–417, 1992.
- Protein distance constraints predicted by neural networks and probability density functions. Protein Engineering, 10(11):1241–1248, 1997.
- William R Pearson. An introduction to sequence similarity (“homology”) searching. Current protocols in bioinformatics, 42(1):3–1, 2013.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970.
- Osamu Gotoh. An improved algorithm for matching biological sequences. Journal of molecular biology, 162(3):705–708, 1982.
- CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23):3150–3152, 2012.
- uCLUST – a new algorithm for clustering unstructured data. ARPN Journal of Engineering and Applied Sciences, 10(5):2108–2117, 2015.
- HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature methods, 9(2):173–175, 2012.
- MMseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics, 32(9):1323–1330, 2016.
- Constructing benchmark test sets for biological sequence analysis using independent set algorithms. PLOS Computational Biology, 18(3):e1009492, 2022.
- DataSAIL: Data splitting against information leakage. bioRxiv, page 2023.11.15.566305, 2023.
- GraphPart: Homology partitioning for biological sequence analysis. NAR Genomics and Bioinformatics, 5(4):lqad088, 2023.
- DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics, 33(21):3387–3395, 2017.
- DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic acids research, 50(W1):W228–W234, 2022.
- Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17(1):1–14, 2016.
- Rapid and precise alignment of raw reads against redundant databases with KMA. BMC bioinformatics, 19:1–8, 2018.
- Sustainable data analysis with Snakemake. F1000Research, 10:33, 2021.
- Benchmarking of methods for identification of antimicrobial resistance genes in bacterial whole genome data. Journal of Antimicrobial Chemotherapy, 71(9):2484–2488, 2016.
- Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103–2110, 2016.
- A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, volume 96, pages 226–231. AAAI Press, 1996.
- Philip TLC Clausen. Scaling neighbor joining to one million taxa with dynamic and heuristic neighbor joining. Bioinformatics, 39(1):btac774, 2023.
- DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 519–530, 2015.
- Applying tabu search with influential diversification to multiprocessor scheduling. Computers & operations research, 21(8):877–884, 1994.
- Resfinder 4.0 for predictions of phenotypes from genotypes. Journal of Antimicrobial Chemotherapy, 75(12):3491–3500, 2020.
- Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic acids research, 44(D1):D733–D745, 2016.
- William R Pearson. Finding protein and nucleotide similarities with fasta. Current protocols in bioinformatics, 53(1):3–9, 2016.
- Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Bioinformatics, 9(1):56–68, 1991.
- Burkhard Rost. Twilight zone of protein sequence alignments. Protein engineering, 12(2):85–94, 1999.
- Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975.
- Jan Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry, 28(5-6):367–374, 2004.
- SigOpt documentation. Technical Report SO-12/14 – Revision 1.07, SigOpt, Inc., 2019. URL https://sigopt.com/docs.
- Wilcoxon-mann-whitney or t-test? on assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics surveys, 4:1, 2010.
- Brucella evolution and taxonomy. Veterinary microbiology, 90(1-4):209–227, 2002.
- Genetic variability determinants of helicobacter pylori: influence of clinical background and geographic origin of isolates. The Journal of infectious diseases, 181(5):1674–1681, 2000.
- Aligning artificial intelligence with climate change mitigation. Nature Climate Change, 12(6):518–527, 2022.
- Emboss: the European molecular biology open software suite. Trends in genetics, 16(6):276–277, 2000.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.