Database Reference
In-Depth Information
The gram counting order has the length bounding property as well.
Given vector s for string s in gram counting order, the length of s is
L =1 s [ ]
q + 1, by the definition of q -grams. This implies that the
length of the strings in string interval [ s i ,s j ] is bounded in the interval
[ L =1 B l
− q +1 , L =1 B u − q + 1]. This allows us to correctly answer
queries based on normalized, as well as standard edit distance.
8.3.2.3
Gram location order
In gram counting order the positional information of the q -grams is
simply discarded. Another string order, called Gram Location Order ,
exploits the positions of the q -grams to improve edit distance based
pruning. The gram location order satisfies comparability, lower bound-
ing, and pairwise lower bounding, and hence supports all types of
queries but not normalized edit distance.
To conduct the transformation, the q -grams are extracted along
with their actual positions in the string. By hashing the q -grams to inte-
gers, each positional q -gram is represented by a vector of two entries,
the hash value of the q -gram and the position of the q -gram. We use a
hash value instead of the idf to avoid the updating issues related with
computing idf s when strings are added or deleted from the B-tree.
To avoid storing large sets of hash-value/position pairs in the inter-
mediate B-tree nodes, we sort the set elements based on the increas-
ing two-dimensional z -order value of each element (i.e., the z -order
value of pair ( h s ,p s ), where h s is the hash value of the q -gram). Then,
for each set we preserve only the first L elements in the z -order, and
finally sort the elements in increasing order of positions. In the rest
we use s to denote the top- L positional q -grams for string s , i.e.,
s =
, in which h i is the q -gram hash value, p i
is the corresponding position of the q -gram in the original string s , and
p 1 ≤ ··· ≤ p L .
Simply stated, we preserve only a pseudo-random subset of posi-
tional q -grams per string, and approximately compare strings based
on these subsets only (the pseudo-random ordering is imposed by the
hash function h and the z -order used for each vector component, which
( h 1 ,p 1 ) ,..., ( h L ,p L )
{
}
Search WWH ::




Custom Search