Database Reference
In-Depth Information
descendant h might see a reduction in bloat.
If h 's bloat decreases, all
ancestors u of h must update their δB/δS scores.
There is a subtle asymmetry in how the code is set up. Here we begin
with R =
and grow R . We cannot, for instance, begin with R = A
and discard unworthy atypes with the smallest δB/δS . Initially, all specific
atypes will be in R , and more general atypes will appear completely valueless.
AtypeSubsetChooser will steadily discard any atype that is not directly
in the query log. Eventually, when the log is completely processed, we will
be cornered into choosing a subset of atypes that directly appear in the log.
Therefore, we will not be able to get any benefit out of choosing generalizations
that are nearby confluences of many atypes in the log.
10.4.6
Experiments
10.4.6.1
Estimated space-time tradeoff
Figure 10.28 (upper chart) shows the reduction in estimated maximum
bloat over all queries as AtypeSubsetChooser grows R .Eachcurveisfora
different Lidstone parameter . The estimated average bloat over all queries
would be overly influenced by a few outliers (see Figure 10.26 ). Therefore we
discard the lowest and highest 2% of bloats and show a robust average over
the rest (lower chart).
The curves in Figure 10.28 show a prominent knee: by the time the
(estimated) index size is allowed to grow to 145 MB, the robust average bloat
is 7, and it drops to 2 with an estimated index size of only 300 MB ( =10 3 ).
Very low results in low queryProb for atypes not seen in the training set,
leading to an excessively aggressive discarding of atypes and consequently high
test-set bloats. As is increased, queryProb increases, forcing AtypeSubset-
Chooser to conservatively include more atypes not seen in the training set.
It is comforting to see in Figure 10.29 that the best trade-off happens
for roughly the same value of that provided the largest cross-validated
log-likelihood in Figure 10.19 . This need not have happened: maximizing
workload likelihood is not the same as reducing query bloat.
10.4.6.2
Observed space-time trade-off
Next we ran multiple queries with various R s having different index sizes
to find actual running times and, hence, actual bloats ( Figure 10.30 ). The
average observed bloat curve follows the estimated bloat curve in Figure 10.28
quite closely. In fact, averaged over many queries, our simple bloat prediction
model does even better than at a per-query level (see Figure 10.25 ). With a
modest 515 MB atype subset index, the average bloat is brought down to only
1 . 85.
 
Search WWH ::




Custom Search