Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

3.10 Summary of Chapter 3

✦ Jaccard Similarity : The Jaccard similarity of sets is the ratio of the size of the intersection of the sets to the size of

the union. This measure of similarity is suitable for many applications, including textual similarity of documents and

similarity of buying habits of customers.

✦ Shingling : A k -shingle is any k characters that appear consecutively in a document. If we represent a document by its

set of k -shingles, then the Jaccard similarity of the shingle sets measures the textual similarity of documents. Some-

times, it is useful to hash shingles to bit strings of shorter length, and use sets of hash values to represent documents.

✦ Minhashing : A minhash function on sets is based on a permutation of the universal set. Given any such permutation,

the minhash value for a set is that element of the set that appears first in the permuted order.

✦ Minhash Signatures : We may represent sets by picking some list of permutations and computing for each set its

minhash signature, which is the sequence of minhash values obtained by applying each permutation on the list to

that set. Given two sets, the expected fraction of the permutations that will yield the same minhash value is exactly

the Jaccard similarity of the sets.

✦ Efficient Minhashing : Since it is not really possible to generate random permutations, it is normal to simulate a per-

mutation by picking a random hash function and taking the minhash value for a set to be the least hash value of any

of the set's members.

✦ Locality-Sensitive Hashing for Signatures : This technique allows us to avoid computing the similarity of every pair

of sets or their minhash signatures. If we are given signatures for the sets, we may divide them into bands, and only

measure the similarity of a pair of sets if they are identical in at least one band. By choosing the size of bands appro-

priately, we can eliminate from consideration most of the pairs that do not meet our threshold of similarity.

✦ Distance Measures : A distance measure is a function on pairs of points in a space that satisfy certain axioms. The

distance between two points is 0 if the points are the same, but greater than 0 if the points are different. The distance

is symmetric; it does not matter in which order we consider the two points. A distance measure must satisfy the tri-

angle inequality: the distance between two points is never more than the sum of the distances between those points

and some third point.

✦ Euclidean Distance : The most common notion of distance is the Euclidean distance in an n -dimensional space. This

distance, sometimes called the L 2 -norm, is the square root of the sum of the squares of the differences between the

points in each dimension. Another distance suitable for Euclidean spaces, called Manhattan distance or the L 1 -norm

is the sum of the magnitudes of the differences between the points in each dimension.

✦ Jaccard Distance : One minus the Jaccard similarity is a distance measure, called the Jaccard distance.

✦ Cosine Distance : The angle between vectors in a vector space is the cosine distance measure. We can compute the

cosine of that angle by taking the dot product of the vectors and dividing by the lengths of the vectors.

✦ Edit Distance : This distance measure applies to a space of strings, and is the number of insertions and/or deletions

needed to convert one string into the other. The edit distance can also be computed as the sum of the lengths of the

strings minus twice the length of the longest common subsequence of the strings.

✦ Hamming Distance : This distance measure applies to a space of vectors. The Hamming distance between two vectors

is the number of positions in which the vectors differ.

✦ Generalized Locality-Sensitive Hashing : We may start with any collection of functions, such as the minhash func-

tions, that can render a decision as to whether or not a pair of items should be candidates for similarity checking. The

only constraint on these functions is that they provide a lower bound on the probability of saying “yes” if the dis-

tance (according to some distance measure) is below a given limit, and an upper bound on the probability of saying

“yes” if the distance is above another given limit. We can then increase the probability of saying “yes” for nearby

items and at the same time decrease the probability of saying “yes” for distant items to as great an extent as we wish,

by applying an AND construction and an OR construction.

✦ Random Hyperplanes and LSH for Cosine Distance : We can get a set of basis functions to start a generalized LSH

for the cosine distance measure by identifying each function with a list of randomly chosen vectors. We apply a

function to a given vector v by taking the dot product of v with each vector on the list. The result is a sketch consist-

Mining of Massive Datasets

Search WWH ::

Custom Search

Home