Database Reference
In-Depth Information
The gram counting order has the length bounding property as well.
Given vector
s
for string
s
in gram counting order, the length of
s
is
L
=1
s
[
]
q
+ 1, by the definition of
q
-grams. This implies that the
length of the strings in string interval [
s
i
,s
j
] is bounded in the interval
−
[
L
=1
B
l
− q
+1
,
L
=1
B
u
− q
+ 1]. This allows us to correctly answer
queries based on normalized, as well as standard edit distance.
8.3.2.3
Gram location order
In gram counting order the positional information of the
q
-grams is
simply discarded. Another string order, called
Gram Location Order
,
exploits the positions of the
q
-grams to improve edit distance based
pruning. The gram location order satisfies comparability, lower bound-
ing, and pairwise lower bounding, and hence supports all types of
queries but not normalized edit distance.
To conduct the transformation, the
q
-grams are extracted along
with their actual positions in the string. By hashing the
q
-grams to inte-
gers, each positional
q
-gram is represented by a vector of two entries,
the hash value of the
q
-gram and the position of the
q
-gram. We use a
hash value instead of the
idf
to avoid the updating issues related with
computing
idf
s when strings are added or deleted from the B-tree.
To avoid storing large sets of hash-value/position pairs in the inter-
mediate B-tree nodes, we sort the set elements based on the increas-
ing two-dimensional
z
-order value of each element (i.e., the
z
-order
value of pair (
h
s
,p
s
), where
h
s
is the hash value of the
q
-gram). Then,
for each set we preserve only the first
L
elements in the
z
-order, and
finally sort the elements in increasing order of positions. In the rest
we use
s
to denote the top-
L
positional
q
-grams for string
s
, i.e.,
s
=
, in which
h
i
is the
q
-gram hash value,
p
i
is the corresponding position of the
q
-gram in the original string
s
, and
p
1
≤ ··· ≤ p
L
.
Simply stated, we preserve only a pseudo-random subset of posi-
tional
q
-grams per string, and approximately compare strings based
on these subsets only (the pseudo-random ordering is imposed by the
hash function
h
and the
z
-order used for each vector component, which
(
h
1
,p
1
)
,...,
(
h
L
,p
L
)
{
}