Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

the probe string. Determine the length of the prefix that must be considered, that is, ⌊ (1 −

J ) L s ⌋ + 1. For each symbol appearing in one of the prefix positions of s , we look in the in-

dex bucket for that symbol, and we compare s with all the strings appearing in that bucket.

If we want to solve the many-many problem, start with an empty database of strings and

indexes. For each set S , we treat S as a new set for the many-one problem. We convert S

to a string s , which we treat as a probe string in the many-one problem. However, after we

examine an index bucket, we also add s to that bucket, so s will be compared with later

strings that could be matches.

3.9.5

Using Position Information

Consider the strings s = acdefghijk and t = bcdefghijk , and assume J = 0 . 9. Since

both strings are of length 10, they are indexed under their first two symbols. Thus, s is in-

dexed under a and c , while t is indexed under b and c . Whichever is added last will find

the other in the bucket for c , and they will be compared. However, since c is the second

symbol of both, we know there will be two symbols, a and b in this case, that are in the

union of the two sets but not in the intersection. Indeed, even though s and t are identical

from c to the end, their intersection is 9 symbols and their union is 11; thus SIM( s, t ) =

9/11, which is less than 0.9.

If we build our index based not only on the symbol, but on the position of the symbol

within the string, we could avoid comparing s and t above. That is, let our index have a

bucket for each pair ( x, i ), containing the strings that have symbol x in position i of their

prefix. Given a string s , and assuming J is the minimum desired Jaccard distance, we look

at the prefix of s , that is, the positions 1 through ⌊ (1 − J ) L s ⌋ + 1. If the symbol in position i

of the prefix is x , add s to the index bucket for ( x, i ).

Now consider s as a probe string. With what buckets must it be compared? We shall visit

the symbols of the prefix of s from the left, and we shall take advantage of the fact that

we only need to find a possible matching string t if none of the previous buckets we have

examined for matches held t . That is, we only need to find a candidate match once. Thus, if

we find that the i th symbol of s is x , then we need look in the bucket ( x, j ) for certain small

values of j .

To compute the upper bound on j , suppose t is a string none of whose first j − 1 symbols

matched anything in s , but the i th symbol of s is the same as the j th symbol of t . The highest

value of SIM( s, t ) occurs if s and t are identical beyond their i th and j th symbols, respect-

ively, as suggested by Fig. 3.14 . If that is the case, the size of their intersection is L s − i

+ 1, since that is the number of symbols of s that could possibly be in t . The size of their

union is at least L s + j − 1. That is, s surely contributes L s symbols to the union, and there

Search WWH ::

Custom Search

Home