Papers
Topics
Authors
Recent
Search
2000 character limit reached

Intermediate N-Gramming: Deterministic and Fast N-Grams For Large N and Large Datasets

Published 18 Nov 2025 in cs.DS | (2511.14955v1)

Abstract: The number of n-gram features grows exponentially in n, making it computationally demanding to compute the most frequent n-grams even for n as small as 3. Motivated by our production machine learning system built on n-gram features, we ask: is it possible to accurately, deterministically, and quickly recover the top-k most frequent n-grams? We devise a multi-pass algorithm called Intergrams that constructs candidate n-grams from the preceding (n - 1)-grams. By designing this algorithm with hardware in mind, our approach yields more than an order of magnitude speedup (up to 33x!) over the next known fastest algorithm, even when similar optimizations are applied to the other algorithm. Using the empirical power-law distribution over n-grams, we also provide theory to inform the efficacy of our multi-pass approach. Our code is available at https://github.com/rcurtin/Intergrams.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.