Database Reference
In-Depth Information
In what follows, we shall assume all strings represent sets in the manner just described.
Thus, we shall talk about the Jaccard similarity of strings, when strictly speaking we mean
the similarity of the sets that the strings represent. Also, we shall talk of the length of a
string, as a surrogate for the number of elements in the set that the string represents.
Note that the documents discussed in Section 3.9.1 do not exactly match this model, even
though we can see documents as strings. To fit the model, we would shingle the documents,
assign an order to the shingles, and represent each document by its list of shingles in the
selected order.
A Better Ordering for Symbols
Instead of using the obvious order for elements of the universal set, e.g., lexico-graphic order for shingles, we can
order symbols rarest first. That is, determine how many times each element appears in the collection of sets, and order
them by this count, lowest first. The advantage of doing so is that the symbols in prefixes will tend to be rare. Thus,
they will cause that string to be placed in index buckets that have relatively few members. Then, when we need to
examine a string for possible matches, we shall find few other strings that are candidates for comparison.
3.9.3
Length-Based Filtering
The simplest way to exploit the string representation of Section 3.9.2 is to sort the strings
by length. Then, each string s is compared with those strings t that follow s in the list, but
are not too long. Suppose the upper bound on Jaccard distance between two strings is J . For
any string x , denote its length by L x . Note that L s L t . The intersection of the sets represen-
ted by s and t cannot have more than L s members, while their union has at least L t members.
Thus, the Jaccard similarity of s and t , which we denote SIM( s, t ), is at most L s / L t . That is,
in order for s and t to require comparison, it must be that J L s / L t , or equivalently, L t
L s / J .
EXAMPLE 3.25 Suppose that s is a string of length 9, and we are looking for strings with
at least 0.9 Jaccard similarity. Then we have only to compare s with strings following it in
the length-based sorted order that have length at most 9/0 . 9 = 10. That is, we compare s
with those strings of length 9 that follow it in order, and all strings of length 10. We have
no need to compare s with any other string.
Suppose the length of s were 8 instead. Then s would be compared with following strings
of length up to 8/0 . 9 = 8 . 89. That is, a string of length 9 would be too long to have a Jaccard
similarity of 0.9 with s , so we only have to compare s with the strings that have length 8
but follow it in the sorted order.
Search WWH ::




Custom Search