Database Reference
In-Depth Information
identifiers or increasing/decreasing order of L p -norms, depending on
the index being built.
Assume that the input datasets are too large to fit in memory. If
the sort order is according to string identifiers then an external sort
algorithm can be used to sort the data directly. Then, the token lists of
the inverted index are simply populated incrementally, as new strings
are processed in string identifier order.
On the other hand, if the sort order is according to L p -norms, then
the algorithm has to compute the norms of each string first. Depending
on the nature of token weights used, computing the norms might require
at least one additional scan of the data. Consider for example idf based
weights. In order to compute idf based L p -norms, the idf of each token
has to be computed first. Computing idf s requires scanning the data
to compute the document frequency of each token. Once the idf s have
been computed, an external sort algorithm can be used to sort the data.
An alternative approach is to avoid sorting the data initially.
Instead, we can incrementally populate the token lists as strings are
processed and tokenized out of order. After all strings are processed we
can sort each token list independently. The drawback of this approach
is that, provided that the cardinality of the token universe Λ is very
large, we might have to actively manage a very large number of token
lists, which can be very expensive. On the other hand, if the cardinality
of Λ is expected to be small then this simplistic approach can be faster
than external sorting of the input dataset. Notice also that if token
lists are maintained as B-trees, out of order processing of strings will
immediately result in properly sorted token lists. Nevertheless, in this
case every single insertion of a string entry in any token list results in a
random access. A better approach is, once again, to use external sorting
to sort the input dataset first, and bulk-load the B-trees corresponding
to the token lists as a final step.
6.7
Index Updates
Typically, inverted indexes are used for mostly static data. More often
than not, token lists are represented either by simple arrays in main
memory, or flat files in secondary storage, since these representations
Search WWH ::




Custom Search