Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dimension Independent Similarity Computation

Published 11 Jun 2012 in cs.DS, cs.AI, and cs.DC | (1206.2082v4)

Abstract: We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high dimensional sparse vectors. All of our results are provably independent of dimension, meaning apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension, thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similiarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems at large scale using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com.

Authors (2)
Citations (73)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.