Database Reference
In-Depth Information
for each coordinate corresponding to a token λ i
s and 0 elsewhere
(i.e., the vector space model). The basic partitioning signatures work
only for unit token weights, in which case the string vectors contain 1
in each coordinate corresponding to all tokens λ i
s and 0 elsewhere.
Extensions for uniform (other than unit) and non-uniform weights will
be discussed shortly.
Let the tokens in Λ be ordered using the permutation function h .
In other words the token corresponding to the i -th vector coordinate
is determined by h . The Hamming distance between two vectors is the
number of dimensions (i.e., tokens) in which the two vectors differ,
that is
Definition 7.6 (Hamming Distance). Given two
|
Λ
|
-dimensional
b 1 ,...,b s
b 1 ,...,b r
Boolean vectors s =
{
| Λ | }
, r =
{
| Λ | }
, the Hamming dis-
tance is defined as
( s, r )= | Λ |
I ( b i = b i ) ,
H
i =1
where I ( b i = b i ) is an indicator variable that returns 1 if b i = b i and 0
otherwise.
The definition can be straightforwardly extended for sequences and
frequency-sets.
Assume that we partition the vectors into θ + 1 groups of consecu-
tive dimensions from the permuted token universe h (Λ). If two vectors
are within Hamming distance θ then by the pigeonhole principle the two
vectors must completely agree in at least one partition. Given a string
s we can create a simple string signature based on the pigeonhole prin-
ciple by hashing the bit-vectors corresponding to each one of the θ +1
partitions using a hash function h :
[0 , 2 b )( b dependent on the
desired accuracy), and maintaining the resulting θ + 1 hash values. We
denote this signature with PG 1 . An example is shown in Figure 7.3.
Two strings are within Hamming distance θ if and only if their
hash signatures agree in at least one hash value corresponding to the
same partition . For convenience, we can concatenate the partition num-
ber along with the bit-vector corresponding to that partition and hash
}
{
0 , 1
 
Search WWH ::




Custom Search