Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

for each coordinate corresponding to a token λ i ∈

s and 0 elsewhere

(i.e., the vector space model). The basic partitioning signatures work

only for unit token weights, in which case the string vectors contain 1

in each coordinate corresponding to all tokens λ i ∈

s and 0 elsewhere.

Extensions for uniform (other than unit) and non-uniform weights will

be discussed shortly.

Let the tokens in Λ be ordered using the permutation function h .

In other words the token corresponding to the i -th vector coordinate

is determined by h . The Hamming distance between two vectors is the

number of dimensions (i.e., tokens) in which the two vectors differ,

that is

Definition 7.6 (Hamming Distance). Given two

|

Λ

|

-dimensional

b 1 ,...,b s

b 1 ,...,b r

Boolean vectors s =

{

| Λ | }

, r =

{

| Λ | }

, the Hamming dis-

tance is defined as

( s, r )= | Λ |

I ( b i = b i ) ,

H

i =1

where I ( b i = b i ) is an indicator variable that returns 1 if b i = b i and 0

otherwise.

The definition can be straightforwardly extended for sequences and

frequency-sets.

Assume that we partition the vectors into θ + 1 groups of consecu-

tive dimensions from the permuted token universe h (Λ). If two vectors

are within Hamming distance θ then by the pigeonhole principle the two

vectors must completely agree in at least one partition. Given a string

s we can create a simple string signature based on the pigeonhole prin-

ciple by hashing the bit-vectors corresponding to each one of the θ +1

partitions using a hash function h :

[0 , 2 b )( b dependent on the

desired accuracy), and maintaining the resulting θ + 1 hash values. We

denote this signature with PG 1 . An example is shown in Figure 7.3.

Two strings are within Hamming distance θ if and only if their

hash signatures agree in at least one hash value corresponding to the

same partition . For convenience, we can concatenate the partition num-

ber along with the bit-vector corresponding to that partition and hash

} ∗ ⇒

{

0 , 1

Approximate String Processing

Search WWH ::

Custom Search

Home