Database Reference
In-Depth Information
for each coordinate corresponding to a token
λ
i
∈
s
and 0 elsewhere
(i.e., the vector space model). The basic partitioning signatures work
only for unit token weights, in which case the string vectors contain 1
in each coordinate corresponding to all tokens
λ
i
∈
s
and 0 elsewhere.
Extensions for uniform (other than unit) and non-uniform weights will
be discussed shortly.
Let the tokens in Λ be ordered using the permutation function
h
.
In other words the token corresponding to the
i
-th vector coordinate
is determined by
h
. The Hamming distance between two vectors is the
number of dimensions (i.e., tokens) in which the two vectors differ,
that is
Definition 7.6 (Hamming Distance).
Given two
|
Λ
|
-dimensional
b
1
,...,b
s
b
1
,...,b
r
Boolean vectors
s
=
{
|
Λ
|
}
,
r
=
{
|
Λ
|
}
, the Hamming dis-
tance is defined as
(
s, r
)=
|
Λ
|
I
(
b
i
=
b
i
)
,
H
i
=1
where
I
(
b
i
=
b
i
) is an indicator variable that returns 1 if
b
i
=
b
i
and 0
otherwise.
The definition can be straightforwardly extended for sequences and
frequency-sets.
Assume that we partition the vectors into
θ
+ 1 groups of consecu-
tive dimensions from the permuted token universe
h
(Λ). If two vectors
are within Hamming distance
θ
then by the pigeonhole principle the two
vectors must completely agree in at least one partition. Given a string
s
we can create a simple string signature based on the pigeonhole prin-
ciple by hashing the bit-vectors corresponding to each one of the
θ
+1
partitions using a hash function
h
:
[0
,
2
b
)(
b
dependent on the
desired accuracy), and maintaining the resulting
θ
+ 1 hash values. We
denote this signature with
PG
1
. An example is shown in Figure 7.3.
Two strings are within Hamming distance
θ
if and only if their
hash signatures agree in at least one hash value
corresponding to the
same partition
. For convenience, we can concatenate the partition num-
ber along with the bit-vector corresponding to that partition and hash
}
∗
⇒
{
0
,
1