Finding Similar Items - Mining of Massive Datasets

Databases Reference

In-Depth Information

3. They must be e cient, in two ways:

(a) They must be able to identify candidate pairs in time much less

than the time it takes to look at all pairs. For example, minhash

functions have this capability, since we can hash sets to minhash

values in time proportional to the size of the data, rather than the

square of the number of sets in the data. Since sets with common

values are colocated in a bucket, we have implicitly produced the

candidate pairs for a single minhash function in time much less than

the number of pairs of sets.

(b) They must be combinable to build functions that are better at avoid-

ing false positives and negatives, and the combined functions must

also take time that is much less than the number of pairs. For ex-

ample, the banding technique of Section 3.4.1 takes single minhash

functions, which satisfy condition 3a but do not, by themselves have

the S-curve behavior we want, and produces from a number of min-

hash functions a combined function that has the S-curve shape.

Our first step is to define “locality-sensitive functions” generally. We then

see how the idea can be applied in several applications. Finally, we discuss

how to apply the theory to arbitrary data with either a cosine distance or a

Euclidean distance measure.

3.6.1 Locality-Sensitive Functions

For the purposes of this section, we shall consider functions that take two items

and render a decision about whether these items should be a candidate pair.

In many cases, the function f will “hash” items, and the decision will be based

on whether or not the result is equal. Because it is convenient to use the

notation f (x) = f (y) to mean that f (x, y) is “yes; make x and y a candidate

pair,” we shall use f (x) = f (y) as a shorthand with this meaning. We also use

f (x) = f (y) to mean “do not make x and y a candidate pair unless some other

function concludes we should do so.”

A collection of functions of this form will be called a family of functions.

For example, the family of minhash functions, each based on one of the possible

permutations of rows of a characteristic matrix, form a family.

Let d 1 < d 2 be two distances according to some distance measure d. A

family F of functions is said to be (d 1 , d 2 , p 1 , p 2 )-sensitive if for every f in F:

1. If d(x, y)≤d 1 , then the probability that f (x) = f (y) is at least p 1 .

2. If d(x, y)≥d 2 , then the probability that f (x) = f (y) is at most p 2 .

Figure 3.9 illustrates what we expect about the probability that a given

function in a (d 1 , d 2 , p 1 , p 2 )-sensitive family will declare two items to be a can-

didate pair. Notice that we say nothing about what happens when the distance

Search WWH ::

Custom Search

Home