Index Structures - Approximate String Processing

Database Reference

In-Depth Information

Table 5.2. An inverted representation

of the strings in Table 5.1 as inverted

lists. It is assumed that common stop

words and special characters have been

removed, stemming of words has occurred,

all tokens have been converted to lowercase,

and strings are considered as sequences

of tokens (resulting in string-identifier/

token-position pairs).

Token

Inverted List

child

( s 1 , 1) , ( s 2 , 4) , ( s 3 , 3) , ( s 4 , 2)

food

( s 1 , 2) , ( s 3 , 1)

found

( s 4 , 3)

fund

( s 1 , 3)

international

( s 4 , 1)

one

( s 2 , 1)

laptop

( s 2 , 2)

unicef

( s 5 , 1)

common stop words have been eliminated. The two representations are

equivalent. With the first representation one can eciently determine

all tokens contained in a given string (by simply scanning the set cor-

responding to that string), while with the second representation one

can eciently identify all strings containing a specific token (by simply

scanning the set corresponding to that token).

The advantage of the inverted index is that it can be used to answer

union and intersection queries very eciently. Given a query string, rep-

resented as a set of tokens, one can find all the strings containing at least

one of the query tokens by simply taking the union of inverted lists cor-

responding to these tokens. For example, given query q =

{

'Children',

'International'

, the union of lists 'child' and 'international' results in

string identifiers s 1 , s 2 , s 3 , s 4 . To find all the strings containing all of the

query tokens one has to take the intersection of the lists corresponding

to the query tokens (and hence identify all data strings that are super

sets of the query string). For example the intersection of lists 'child'

and 'international' results is string identifier s 4 . Hybrid list intersec-

tion strategies can be used to evaluate more involved string similarity

functions, as explained in detail in Section 6.

}

Approximate String Processing

Search WWH ::

Custom Search

Home