Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

For that purpose, we sort token lists in increasing (or decreasing)

order of the L 1 -norms of strings, rather than in string identifier order.

(Without loss of generality, in the rest we assume that the inverted lists

are always sorted in increasing order of L p -norms.) Using Lemma 6.1,

we can directly skip over all candidate strings with L 1 -norm

s

1 <

θ

1 is the L 1 -norm of the query) and stop scanning

a list whenever we encounter the first candidate string with L 1 -norm

v

1 (where

v

1 > v θ .

Given that the lists are not sorted in increasing string identifier

order, the obvious choice is to use the classic threshold based algo-

rithms to compute the similarity of each string. Threshold algorithms

utilize special terminating conditions that enable processing to stop

before exhaustively scanning all token lists. This is easy to show if

the similarity function is a monotone aggregate function. Let α i ( s, v )=

W ( λ i

s

)

max( s 1 ,v 1 ) be the partial similarity of data string s and query token

λ i . Then,

( s, v )= λ i ∈s∩v α i ( s, v ). It can be shown that

N

( s, v )is

( s ,v )if

a monotone aggregate function, i.e.,

N

( s, v )

≤N

∀

i

∈

[1 ,m ]:

α i ( s ,v ). The proof is straightforward.

There are three threshold algorithm flavors. The first scans lists

sequentially and in a round robin fashion and computes the similarity

of strings incrementally. The second uses random accesses to compute

similarity aggressively every time a new string identifier is encountered

in one of the lists. The third uses combination strategies of both sequen-

tial and random accesses.

Since the sequential round robin algorithm computes similarity

incrementally, it has to temporarily store in main memory informa-

tion about strings whose similarity has only been partially computed.

Hence, the algorithm has a high book-keeping cost, given that the can-

didate set needs to be maintained as a hash table organized by string

identifiers. The aggressive random access based algorithm computes the

similarity of strings in one step and hence does not need to store any

information in main memory. Hence it has a very small book-keeping

cost, but on the other hand it has to perform a large number of ran-

dom accesses to achieve this. This could be a drawback on traditional

storage devices (like hard drives) but unimportant in modern solid

α i ( s, v )

≤

Approximate String Processing

Search WWH ::

Custom Search

Home