Finding Similar Items - Mining of Massive Datasets

Databases Reference

In-Depth Information

of the string times the maximum Jaccard distance (1 minus the minimum

Jaccard similarity).

F

Position Indexes: We can index strings not only on the characters in

their prefixes, but on the position of that character within the prefix. We

reduce the number of pairs of strings that must be compared, because

if two strings share a character that is not in the first position in both

strings, then we know that either there are some preceding characters that

are in the union but not the intersection, or there is an earlier symbol that

appears in both strings.

F

Su x Indexes: We can also index strings based not only on the characters

in their prefixes and the positions of those characters, but on the length

of the character's su x - the number of positions that follow it in the

string. This structure further reduces the number of pairs that must be

compared, because a common symbol with different su x lengths implies

additional characters that must be in the union but not in the intersection.

3.11

References for Chapter 3

The technique we called shingling is attributed to [10]. The use in the manner

we discussed here is from [2]. Minhashing comes from [3]. The original works

on locality-sensitive hashing were [9] and [7]. [1] is a useful summary of ideas

in this field.

[4] introduces the idea of using random-hyperplanes to summarize items in

a way that respects the cosine distance. [8] suggests that random hyperplanes

plus LSH can be more accurate at detecting similar documents than minhashing

plus LSH.

Techniques for summarizing points in a Euclidean space are covered in [6].

[11] presented the shingling technique based on stop words.

The length and prefix-based indexing schemes for high-similarity matching

comes from [5]. The technique involving su x length is from [12].

1. A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approxi-

mate nearest neighbor in high dimensions,” Comm. ACM 51:1, pp. 117-

122, 2008.

2. A.Z. Broder, “On the resemblance and containment of documents,” Proc.

Compression and Complexity of Sequences, pp. 21-29, Positano Italy,

1997.

3. A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher, “Min-wise

independent permutations,” ACM Symposium on Theory of Computing,

pp. 327-336, 1998.

4. M.S. Charikar, “Similarity estimation techniques from rounding algo-

rithms,” ACM Symposium on Theory of Computing, pp. 380-388, 2002.

Search WWH ::

Custom Search

Home