Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

what follows, we shall refer to the first of these buckets as “the” bucket for f and ignore the

buckets that are required to be singletons.

If we want to solve the many-one problem, we can use many functions from the family

F and precompute their buckets of fingerprints to which they answer “yes.” Then, given a

new fingerprint that we want to match, we determine which of these buckets it belongs to

and compare it with all the fingerprints found in any of those buckets. To solve the many-

many problem, we compute the buckets for each of the functions and compare all finger-

prints in each of the buckets.

Let us consider how many functions we need to get a reasonable probability of catching

a match, without having to compare the fingerprint on the gun with each of the millions of

fingerprints in the database. First, the probability that two fingerprints from different fin-

gers would be in the bucket for a function f in F is (0 . 2) 6 = 0 . 000064. The reason is that

they will both go into the bucket only if they each have a minutia in each of the three grid

points associated with f , and the probability of each of those six independent events is 0.2.

Now, consider the probability that two fingerprints from the same finger wind up in the

bucket for f . The probability that the first fingerprint has minutiae in each of the three

squares belonging to f is (0 . 2) 3 = 0 . 008. However, if it does, then the probability is (0 . 8) 3 =

0 . 512 that the other fingerprint will as well. Thus, if the fingerprints are from the same fin-

ger, there is a 0 . 008 × 0 . 512 = 0 . 004096 probability that they will both be in the bucket of

f . That is not much; it is about one in 200. However, if we use many functions from F , but

not too many, then we can get a good probability of matching fingerprints from the same

finger while not having too many false positives - fingerprints that must be considered but

do not match.

EXAMPLE 3.22 For a specific example, let us suppose that we use 1024 functions chosen

randomly from F . Next, we shall construct a new family F 1 by performing a 1024-way OR

on F . Then the probability that F 1 will put fingerprints from the same finger together in

at least one bucket is 1 − (1 − 0 . 004096) 1024 = 0 . 985. On the other hand, the probability

that two fingerprints from different fingers will be placed in the same bucket is (1 − (1 −

0 . 000064) 1024 = 0 . 063. That is, we get about 1.5% false negatives and about 6.3% false

positives.

□

The result of Example 3.22 is not the best we can do. While it offers only a 1.5% chance

that we shall fail to identify the fingerprint on the gun, it does force us to look at 6.3% of

the entire database. Increasing the number of functions from F will increase the number of

false positives, with only a small benefit of reducing the number of false negatives below

1.5%. On the other hand, we can also use the AND construction, and in so doing, we can

greatly reduce the probability of a false positive, while making only a small increase in the

Search WWH ::

Custom Search

Home