Database Reference
In-Depth Information
number of q -grams within a string. We use a hash function to map q -
grams to a set of buckets of fixed cardinality. We show that the Gram
Counting Order has the properties of comparability, lower bounding,
pairwise lower bounding, and length bounding. Therefore, it can be
used to index on edit distance and normalized edit distance for range
and top- k selection queries, and join queries.
A q -gram set can be intuitively represented as a vector in a high
dimensional space where each dimension corresponds to a distinct q -
gram (i.e., the vector space model). This solution, however, incurs high
storage cost. To compress the information on the vector space, we use a
hash function to map each q -gram to a set of L buckets, and count the
number of q -grams in each bucket. Thus, the q -gram set is transformed
into a vector of L non-negative integers.
The q -gram mismatch lemma (Lemma 8.4) states that the edit dis-
tance between two strings s, r is no smaller than
max | Q ( s ) \ Q ( r ) |
q
.
, | Q ( r ) \ Q ( s ) |
q
E
( s, r )
(8.3)
Here, Q ( s ) \ Q ( r ) is the set of q -grams in Q ( s ) and not Q ( r ), and vice
versa. After mapping strings from gram space to the bucket space, a
new lower bound holds. If s and r are the L -dimensional bucket vector
representations of s and r , respectively, the edit distance between s and
r is no smaller than
s [ ] >r [ ]
,
r [ ] >s [ ]
s [ ]
r [ ]
r [ ]
s [ ]
max
(8.4)
q
q
L . Stated simply, the edit distance should be at least as
large as the largest difference in the q -gram counts between any pair of
corresponding buckets in the bucket representation (correcting for the
fact that one edit operation can change up to qq -grams).
To achieve a tighter lower bound we apply z -order on the q -gram
counting vectors to cluster strings with similar vectors as best as pos-
sible in the one-dimensional B-tree space. Given a vector s , z -order
interleaves the bits from all vector components in a round robin fash-
ion. For example, let the binary values of the vector components be
'11', '10', '01', '11', for L = 4. Then, the z -order value of the vector is
for 1
Search WWH ::




Custom Search