Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

Our first step is to define “locality-sensitive functions” generally. We then see how the

idea can be applied in several applications. Finally, we discuss how to apply the theory to

arbitrary data with either a cosine distance or a Euclidean distance measure.

3.6.1

Locality-Sensitive Functions

For the purposes of this section, we shall consider functions that take two items and render

a decision about whether these items should be a candidate pair. In many cases, the func-

tion f will “hash” items, and the decision will be based on whether or not the result is equal.

Because it is convenient to use the notation f ( x ) = f ( y ) to mean that f ( x, y ) is “yes; make x

and y a candidate pair,” we shall use f ( x ) = f ( y ) as a shorthand with this meaning. We also

use f ( x ) ≠ f ( y ) to mean “do not make x and y a candidate pair unless some other function

concludes we should do so.”

A collection of functions of this form will be called a family of functions. For example,

the family of minhash functions, each based on one of the possible permutations of rows of

a characteristic matrix, form a family.

Let d 1 < d 2 be two distances according to some distance measure d . A family F of func-

tions is said to be ( d 1 , d 2 , p 1 , p 2 ) -sensitive if for every f in F :

(1) If d ( x, y ) ≤ d 1 , then the probability that f ( x ) = f ( y ) is at least p 1 .

(2) If d ( x, y ) ≥ d 2 , then the probability that f ( x ) = f ( y ) is at most p 2 .

Figure 3.9 illustrates what we expect about the probability that a given function in a ( d 1 ,

d 2 , p 1 , p 2 )-sensitive family will declare two items to be a candidate pair. Notice that we say

nothing about what happens when the distance between the items is strictly between d 1 and

d 2 , but we can make d 1 and d 2 as close as we wish. The penalty is that typically p 1 and p 2

are then close as well. As we shall see, it is possible to drive p 1 and p 2 apart while keeping

d 1 and d 2 fixed.

Figure 3.9 Behavior of a ( d 1 , d 2 , p 1 , p 2 )-sensitive function

Search WWH ::

Custom Search

Home