Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

descendant h might see a reduction in bloat.

If h 's bloat decreases, all

ancestors u of h must update their δB/δS scores.

There is a subtle asymmetry in how the code is set up. Here we begin

with R =

and grow R . We cannot, for instance, begin with R = A

and discard unworthy atypes with the smallest δB/δS . Initially, all specific

atypes will be in R , and more general atypes will appear completely valueless.

AtypeSubsetChooser will steadily discard any atype that is not directly

in the query log. Eventually, when the log is completely processed, we will

be cornered into choosing a subset of atypes that directly appear in the log.

Therefore, we will not be able to get any benefit out of choosing generalizations

that are nearby confluences of many atypes in the log.

∅

10.4.6

Experiments

10.4.6.1

Estimated space-time tradeoff

Figure 10.28 (upper chart) shows the reduction in estimated maximum

bloat over all queries as AtypeSubsetChooser grows R .Eachcurveisfora

different Lidstone parameter . The estimated average bloat over all queries

would be overly influenced by a few outliers (see Figure 10.26 ). Therefore we

discard the lowest and highest 2% of bloats and show a robust average over

the rest (lower chart).

The curves in Figure 10.28 show a prominent knee: by the time the

(estimated) index size is allowed to grow to 145 MB, the robust average bloat

is 7, and it drops to 2 with an estimated index size of only 300 MB ( =10 − 3 ).

Very low results in low queryProb for atypes not seen in the training set,

leading to an excessively aggressive discarding of atypes and consequently high

test-set bloats. As is increased, queryProb increases, forcing AtypeSubset-

Chooser to conservatively include more atypes not seen in the training set.

It is comforting to see in Figure 10.29 that the best trade-off happens

for roughly the same value of that provided the largest cross-validated

log-likelihood in Figure 10.19 . This need not have happened: maximizing

workload likelihood is not the same as reducing query bloat.

10.4.6.2

Observed space-time trade-off

Next we ran multiple queries with various R s having different index sizes

to find actual running times and, hence, actual bloats ( Figure 10.30 ). The

average observed bloat curve follows the estimated bloat curve in Figure 10.28

quite closely. In fact, averaged over many queries, our simple bloat prediction

model does even better than at a per-query level (see Figure 10.25 ). With a

modest 515 MB atype subset index, the average bloat is brought down to only

1 . 85.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home