Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

3.9.4

Prefix Indexing

In addition to length, there are several other features of strings that can be exploited to limit

the number of comparisons that must be made to identify all pairs of similar strings. The

simplest of these options is to create an index for each symbol; recall a symbol of a string

is any one of the elements of the universal set. For each string s , we select a prefix of s

consisting of the first p symbols of s . How large p must be depends on L s and J , the lower

bound on Jaccard distance. We add string s to the index for each of its first p symbols.

In effect, the index for each symbol becomes a bucket of strings that must be compared.

We must be certain that any other string t such that SIM( s, t ) ≥ J will have at least one sym-

bol in its prefix that also appears in the prefix of s .

Suppose not; rather SIM( s, t ) ≥ J , but t has none of the first p symbols of s . Then the

highest Jaccard similarity that s and t can have occurs when t is a suffix of s , consisting of

everything but the first p symbols of s . The Jaccard similarity of s and t would then be ( L s

− p )/ L s . To be sure that we do not have to compare s with t , we must be certain that J >

( L s − p )/ L s . That is, p must be at least ⌊ (1 − J ) L s ⌋ + 1. Of course we want p to be as small

as possible, so we do not index string s in more buckets than we need to. Thus, we shall

hereafter take p = ⌊ (1 − J ) L s ⌋ + 1 to be the length of the prefix that gets indexed.

EXAMPLE 3.26 Suppose J = 0 . 9. If L s = 9, then p = ⌊ 0 . 1×9 ⌋ +1 = ⌊ 0 . 9 ⌋ +1 = 1. That is, we

need to index s under only its first symbol. Any string t that does not have the first symbol

of s in a position such that t is indexed by that symbol will have Jaccard similarity with s

that is less than 0.9. Suppose s is bcdefghij . Then s is indexed under b only. Suppose t

does not begin with b . There are two cases to consider.

(1) If t begins with a and SIM( s, t ) ≥ 0 . 9, then it can only be that t is abcdefghij . But

if that is the case, t will be indexed under both a and b . The reason is that L t = 10, so t

will be indexed under the symbols of its prefix of length ⌊ 0 . 1 × 10 ⌋ + 1 = 2.

(2) If t begins with c or a later letter, then the maximum value of SIM( s, t ) occurs when t

is cdefghij . But then SIM( s, t ) = 8/9 < 0 . 9.

In general, with J = 0 . 9, strings of length up to 9 are indexed by their first symbol, strings

of lengths 10-19 are indexed under their first two symbols, strings of length 20-29 are in-

dexed under their first three symbols, and so on.

□

We can use the indexing scheme in two ways, depending on whether we are trying to

solve the many-many problem or a many-one problem; recall the distinction was intro-

duced in Section 3.8.4 . For the many-one problem, we create the index for the entire data-

base. To query for matches to a new set S , we convert that set to a string s , which we call

Search WWH ::

Custom Search

Home