Database Reference
In-Depth Information
Therefore, our first task, in Section 10.4.1, will be to turn raw atype
frequencies from the query log into a smoother distribution over atypes.
Second, in Section 10.4.2 we will formalize our strategy of indexing only a
suitably-chosen subset of atypes; in particular, how to adapt to missing atypes
at query time. Having fixed the query execution template, we will engage in
two modeling tasks: estimating the space saved by indexing only a subset
of atypes (Section 10.4.3) and estimating the query time blow-up because
all atypes were not indexed (Section 10.4.4). Armed with these models, in
Section 10.4.5, we will propose a simple but effective algorithm to choose the
atype subset to index. Finally, we will describe experimental performance in
Section 10.4.6.
10.4.1 Probability of a Query Atype
The atype subset selection algorithm we propose uses an estimate of the
probability of seeing an atype a in a new query, queryProb ( a ). For WordNet
alone, a can have over 18,000 (non-leaf) values, and the skew makes it dicult
to estimate the probabilities of rare atypes.
This is a standard issue in language modeling (29). The solution is to reserve
and distribute a tiny bit of probability over all atypes not seen in training data.
We use the well-known Lidstone smoothing formula to implement this:
queryCount ( a )+
a queryCount ( a )+
queryProb ( a )=
(10.7)
where 0 <≤ 1 is a parameter to be set via cross-validation. Several times, we
randomly split the workload into halves W 1 and W 2 ,estimate queryProb ( a )
using W 1 and estimate the probability of W 2 as
queryCount W 2 ( a )log queryProb W 1 ( a )
a∈W 2
Results are shown in Figure 10.19 ; it is fairly easy to pick off a prominently
best for a given dataset. We shall see later in Section 10.4.6 that has quite
a strong effect on the quality of our index and the query performance.
10.4.2 Pre-Generalize and Post-Filter
Letthefullsetofatypesbe A and imagine that some subset R
A is
registered . During indexing, tokens are attached to the type taxonomy
(see Section 10.3.3.1 ) and we walk up is-a links, only registered atypes are
included in the index. For example, in Figure 10.20 , the heavily-shaded nodes
entity#n#1 and living_thing#n#1 are in R , but the lightly-shaded nodes
are not.
Now suppose we are given an atype index for only the atypes in R ,andgeta
query with atype a
R . For example, corresponding to the natural language
Search WWH ::




Custom Search