Database Reference
In-Depth Information
Table 5.2. An inverted representation
of the strings in Table 5.1 as inverted
lists. It is assumed that common stop
words and special characters have been
removed, stemming of words has occurred,
all tokens have been converted to lowercase,
and strings are considered as sequences
of tokens (resulting in string-identifier/
token-position pairs).
Token
Inverted List
child
( s 1 , 1) , ( s 2 , 4) , ( s 3 , 3) , ( s 4 , 2)
food
( s 1 , 2) , ( s 3 , 1)
found
( s 4 , 3)
fund
( s 1 , 3)
international
( s 4 , 1)
one
( s 2 , 1)
laptop
( s 2 , 2)
unicef
( s 5 , 1)
common stop words have been eliminated. The two representations are
equivalent. With the first representation one can eciently determine
all tokens contained in a given string (by simply scanning the set cor-
responding to that string), while with the second representation one
can eciently identify all strings containing a specific token (by simply
scanning the set corresponding to that token).
The advantage of the inverted index is that it can be used to answer
union and intersection queries very eciently. Given a query string, rep-
resented as a set of tokens, one can find all the strings containing at least
one of the query tokens by simply taking the union of inverted lists cor-
responding to these tokens. For example, given query q =
{
'Children',
'International'
, the union of lists 'child' and 'international' results in
string identifiers s 1 , s 2 , s 3 , s 4 . To find all the strings containing all of the
query tokens one has to take the intersection of the lists corresponding
to the query tokens (and hence identify all data strings that are super
sets of the query string). For example the intersection of lists 'child'
and 'international' results is string identifier s 4 . Hybrid list intersec-
tion strategies can be used to evaluate more involved string similarity
functions, as explained in detail in Section 6.
}
 
Search WWH ::




Custom Search