Database Reference
In-Depth Information
as many times as the tokens it consists of. In most practical cases,
duplicating the strings is infeasible, and for that reason the lists usu-
ally contain only the string identifiers. The only advantage of stor-
ing the actual strings in the lists is that it eliminates the need of at
least one extra read operation for retrieving the actual string when it
is needed for further processing (e.g., reporting it to the user or for
computing a ranking weight). In applications with critical performance
requirements, duplicating strings might be acceptable (this is a typi-
cal performance/space tradeoff). In what follows we assume that only
the string identifiers are contained in the lists, and, in order to avoid
confusion, strings are denoted with lowercase characters (e.g., s ) and
string identifiers using double dots (e.g., s ).
Notice that an inverted index on frequency-sets forfeits the posi-
tional information by replacing it with frequency information (and as
a result reducing the size of the inverted index by collapsing multiple
entries of a given string identifier within an inverted list into one entry).
Finally, an inverted index on sets forfeits the frequency information as
well. Furthermore, one can extend the information stored with each
inverted list entry as well. For example, one can include the position of
the string within a document or the column and table containing the
string in a relational database. The addressing granularity stored along
with each inverted list entry comes at the cost of additional storage
space per list. Alternatively, one can retrieve this information by using
the string identifier to query a separate index for retrieving the actual
string and other associated information, on demand.
The inverted representation of the strings in Table 5.1 is shown
in Table 5.2 after stemming has been performed on the tokens and
Table 5.1. A collection of strings represented as a collection
of sets of tokens.
Tokens
String
1
2
3
4
s 1
'Children's'
'Food'
'Fund'
s 2
'One'
'Laptop'
'per'
'Child'
s 3
'Feed'
'the'
'Children'
s 4
'International'
'Children's'
'Found'
s 5
'UNICEF'
 
Search WWH ::




Custom Search