Database Reference
In-Depth Information
will consider the benefit of query time saved.
An exact estimate of inverted index size is dicult in the face of index
compression techniques (39). The posting list for an atype a (or a token in
general) has corpusCount ( a ) entries in it, so as a first approximation, it takes
space proportional to corpusCount ( a ). Therefore, if subset R is indexed, the
space needed can be approximated as
a∈R corpusCount ( a ) .
(10.8)
5.E+09
4.E+09
3.E+09
2.E+09
1.E+09
0.E+00
0.0E+00 5.0E+08 1.0E+09 1.5E+09 2.0E+09 2.5E+09
Estimated Index Size
FIGURE 10.22 : a∈R corpusCount ( a ) is a very good predictor of the size
of the atype subset index. (Root atypes are not indexed.)
Figure 10.22 shows that this crude approximation is surprisingly accurate.
This is probably because, averaged over many atypes, index compression
affects disk space by a fairly stable and uniform factor.
10.4.4 Query Time Bloat Model
Next we turn to the considerably more dicult task of estimating the factor
by which query execution slows down because only R ,not A ,thesetofall
atypes, has been indexed. This is dicult because, in general, the estimate
will depend on co-occurrence statistics between all possible atypes and all
possible words. In traditional relational database query optimization, where
the number of tables and attributes is modest, estimating multidimensional
“selectivity” of select and join predicates is a challenging problem (20).
Our testbed has over a million distinct tokens and some 18000 atypes
in A . Therefore, capturing correlations with any degree of thoroughness is
impossible, and simplifying assumptions must be made.
Query bloat happens in two stages: first, scanning the inverted index
posting lists takes longer because the posting list of the more general atype
g
R is longer than the posting list of the query atype a ; and second,
 
 
Search WWH ::




Custom Search