Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exploiting New Properties of String Net Frequency for Efficient Computation

Published 19 Apr 2024 in cs.DS | (2404.12701v2)

Abstract: Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, \textsc{single-nf}, how to compute the net frequency of a given string of length $m$, in an input text of length $n$ over an alphabet size $\sigma$. Second, \textsc{all-nf}, given length-$n$ input text, how to report every string of positive net frequency. Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has $O(n)$ construction cost: with this structure, we solve \textsc{single-nf} in $O(m + \sigma)$ time and \textsc{all-nf} in $O(n)$ time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for \textsc{single-nf}. For \textsc{all-nf}, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53–86, 2004. doi:10.1016/S1570-8667(03)00065-0.
  2. Bi-directional r-indexes. In 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 11:1–11:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.CPM.2022.11.
  3. Smaller fully-functional bidirectional BWT indexes. In String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, volume 12303 of Lecture Notes in Computer Science, pages 42–59. Springer, 2020. doi:10.1007/978-3-030-59212-7_4.
  4. Versatile succinct representations of the bidirectional Burrows-Wheeler Transform. In Algorithms - ESA 2013 - 21st Annual European Symposium, Sophia Antipolis, France, September 2-4, 2013. Proceedings, volume 8125 of Lecture Notes in Computer Science, pages 133–144. Springer, 2013. doi:10.1007/978-3-642-40450-4_12.
  5. Genbank. Nucleic Acids Research, 46(Database-Issue):D41–D47, 2018. doi:10.1093/nar/gkx1094.
  6. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms, 17(1):8:1–8:39, 2021. doi:10.1145/3426473.
  7. Borders of Fibonacci strings. Journal of Combinatorial Mathematics and Combinatorial Computing, 20:81–88, 1996.
  8. Aldo de Luca. A combinatorial property of the Fibonacci words. Information Processing Letters, 12(4):193–195, 1981. doi:10.1016/0020-0190(81)90099-5.
  9. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465–492, 2011. doi:10.1137/090779759.
  10. Colored range queries and document retrieval. Theoretical Computer Science, 483:36–50, 2013. doi:10.1016/j.tcs.2012.08.004.
  11. From theory to practice: Plug and play with succinct data structures. In Experimental Algorithms - 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29 - July 1, 2014. Proceedings, volume 8504 of Lecture Notes in Computer Science, pages 326–337. Springer, 2014. doi:10.1007/978-3-319-07959-2_28.
  12. Compressed suffix trees: Efficient computation and storage of LCP-values. ACM Journal of Experimental Algorithmics, 18, 2013. doi:10.1145/2444016.2461327.
  13. A characterization of the squares in a Fibonacci string. Theoretical Computer Science, 172(1-2):281–291, 1997. doi:10.1016/S0304-3975(96)00141-7.
  14. Factorizing strings into repetitions. Theory of Computing Systems, 66(2):484–501, 2022. doi:10.1007/S00224-022-10070-3.
  15. Tighter bounds for the sum of irreducible LCP values. Theoretical Computer Science, 656:265–278, 2016. doi:10.1016/j.tcs.2015.12.009.
  16. Permuted longest-common-prefix array. In Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 22-24, 2009, Proceedings, volume 5577 of Lecture Notes in Computer Science, pages 181–192. Springer, 2009. doi:10.1007/978-3-642-02441-2_17.
  17. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001 Jerusalem, Israel, July 1-4, 2001 Proceedings, volume 2089 of Lecture Notes in Computer Science, pages 181–192. Springer, 2001. doi:10.1007/3-540-48194-X_17.
  18. Resolution of the Burrows-Wheeler Transform conjecture. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 1002–1013. IEEE, 2020. doi:10.1109/FOCS46700.2020.00097.
  19. Largest repetition factorization of Fibonacci words. In String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 284–296. Springer, 2023. doi:10.1007/978-3-031-43980-3_23.
  20. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
  21. Efficient maximal repeat finding using the Burrows-Wheeler Transform and wavelet tree. IEEE ACM Trans. Comput. Biol. Bioinform., 9(2):421–429, 2012. doi:10.1109/TCBB.2011.127.
  22. High throughput short read alignment via bi-directional BWT. In 2009 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2009, Washington, DC, USA, November 1-4, 2009, Proceedings, pages 31–36. IEEE Computer Society, 2009. doi:10.1109/BIBM.2009.42.
  23. Extracting Chinese frequent strings without dictionary from a Chinese corpus and its applications. Journal of Information Science and Engineering, 17(5):805–824, 2001. URL: https://jise.iis.sinica.edu.tw/JISESearch/pages/View/PaperView.jsf?keyId=86_1308.
  24. The properties and further applications of Chinese frequent strings. International Journal of Computational Linguistics and Chinese Language Processing, 9(1), 2004. URL: http://www.aclclp.org.tw/clclp/v9n1/v9n1a7.pdf.
  25. M. Lothaire. Combinatorics on words, Second Edition. Cambridge mathematical library. Cambridge University Press, 1997.
  26. Moritz G. Maaß. Linear bidirectional on-line construction of affix trees. In Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21-23, 2000, Proceedings, volume 1848 of Lecture Notes in Computer Science, pages 320–334. Springer, 2000. doi:10.1007/3-540-45123-4_27.
  27. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi:10.1137/0222058.
  28. Giovanni Manzini. Two space saving tricks for linear time LCP array computation. In Algorithm Theory - SWAT 2004, 9th Scandinavian Workshop on Algorithm Theory, Humlebaek, Denmark, July 8-10, 2004, Proceedings, volume 3111 of Lecture Notes in Computer Science, pages 372–383. Springer, 2004. doi:10.1007/978-3-540-27810-8_32.
  29. A block-sorting lossless data compression algorithm. In Digital SRC Research Report, 1994.
  30. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA, pages 657–666. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545469.
  31. Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Computing Surveys, 54(2):29:1–29:31, 2022. doi:10.1145/3434399.
  32. Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Computing Surveys, 54(2):26:1–26:32, 2022. doi:10.1145/3432999.
  33. Julian Pape-Lange. On extensions of maximal repeats in compressed strings. In 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 27:1–27:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.CPM.2020.27.
  34. Giuseppe Pirillo. Fibonacci numbers and words. Discrete Mathematics, 173(1-3):197–207, 1997. doi:10.1016/S0012-365X(94)00236-C.
  35. Mathieu Raffinot. On maximal repeats in strings. Information Processing Letters, 80(3):165–169, 2001. doi:10.1016/S0020-0190(01)00152-1.
  36. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685–709, 2013. doi:10.1007/s00453-012-9618-6.
  37. Ellen M. Voorhees. Overview of TREC 2003. In Proceedings of The Twelfth Text REtrieval Conference, TREC 2003, Gaithersburg, Maryland, USA, November 18-21, 2003, volume 500-255 of NIST Special Publication, pages 1–13. National Institute of Standards and Technology (NIST), 2003. URL: http://trec.nist.gov/pubs/trec12/papers/OVERVIEW.12.pdf.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.