Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

ing of the signs (+1 or −1) of the dot products. The fraction of positions in which the sketches of two vectors agree,

multiplied by 180, is an estimate of the angle between the two vectors.

✦ LSH For Euclidean Distance : A set of basis functions to start LSH for Euclidean distance can be obtained by choos-

ing random lines and projecting points onto those lines. Each line is broken into fixed-length intervals, and the func-

tion answers “yes” to a pair of points that fall into the same interval.

✦ High-Similarity Detection by String Comparison : An alternative approach to finding similar items, when the

threshold of Jaccard similarity is close to 1, avoids using minhashing and LSH. Rather, the universal set is ordered,

and sets are represented by strings, consisting their elements in order. The simplest way to avoid comparing all pairs

of sets or their strings is to note that highly similar sets will have strings of approximately the same length. If we sort

the strings, we can compare each string with only a small number of the immediately following strings.

✦ Character Indexes : If we represent sets by strings, and the similarity threshold is close to 1, we can index all strings

by their first few characters. The prefix whose characters must be indexed is approximately the length of the string

times the maximum Jaccard distance (1 minus the minimum Jaccard similarity).

✦ Position Indexes : We can index strings not only on the characters in their prefixes, but on the position of that charac-

ter within the prefix. We reduce the number of pairs of strings that must be compared, because if two strings share a

character that is not in the first position in both strings, then we know that either there are some preceding characters

that are in the union but not the intersection, or there is an earlier symbol that appears in both strings.

✦ Suffix Indexes : We can also index strings based not only on the characters in their prefixes and the positions of those

characters, but on the length of the character's suffix - the number of positions that follow it in the string. This struc-

ture further reduces the number of pairs that must be compared, because a common symbol with different suffix

lengths implies additional characters that must be in the union but not in the intersection.

3.11 References for Chapter 3

The technique we called shingling is attributed to [ 10 ] . The use in the manner we discussed

here is from [ 2 ]. Minhashing comes from [ 3 ]. The original works on locality-sensitive

hashing were [ 9 ] and [ 7 ] . [ 1 ] is a useful summary of ideas in this field.

[ 4 ] introduces the idea of using random-hyperplanes to summarize items in a way that

respects the cosine distance. [ 8 ] suggests that random hyperplanes plus LSH can be more

accurate at detecting similar documents than minhashing plus LSH.

Techniques for summarizing points in a Euclidean space are covered in [ 6 ] . [ 11 ] presen-

ted the shingling technique based on stop words.

The length and prefix-based indexing schemes for high-similarity matching comes from

[ 5 ] . The technique involving suffix length is from [ 12 ].

[1] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,”

Comm. ACM 51:1, pp. 117-122, 2008.

[2] A.Z. Broder, “On the resemblance and containment of documents,” Proc. Compression and Complexity of

Sequences , pp. 21-29, Positano Italy, 1997.

[3] A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations,” ACM Sym-

posium on Theory of Computing , pp. 327-336, 1998.

[4] M.S. Charikar, “Similarity estimation techniques from rounding algorithms,” ACM Symposium on Theory of Com-

puting , pp. 380-388, 2002.

Search WWH ::

Custom Search

Home