Database Reference
In-Depth Information
Therefore, our first task, in Section 10.4.1, will be to turn raw atype
frequencies from the query log into a
smoother
distribution over atypes.
Second, in Section 10.4.2 we will formalize our strategy of indexing only a
suitably-chosen subset of atypes; in particular, how to adapt to missing atypes
at query time. Having fixed the query execution template, we will engage in
two modeling tasks: estimating the space saved by indexing only a subset
of atypes (Section 10.4.3) and estimating the query time blow-up because
all atypes were not indexed (Section 10.4.4). Armed with these models, in
Section 10.4.5, we will propose a simple but effective algorithm to choose the
atype subset to index. Finally, we will describe experimental performance in
Section 10.4.6.
10.4.1 Probability of a Query Atype
The atype subset selection algorithm we propose uses an estimate of the
probability of seeing an atype
a
in a new query,
queryProb
(
a
). For WordNet
alone,
a
can have over 18,000 (non-leaf) values, and the skew makes it dicult
to estimate the probabilities of rare atypes.
This is a standard issue in language modeling (29). The solution is to reserve
and distribute a tiny bit of probability over all atypes not seen in training data.
We use the well-known Lidstone smoothing formula to implement this:
queryCount
(
a
)+
a
queryCount
(
a
)+
queryProb
(
a
)=
(10.7)
where 0
<≤
1 is a parameter to be set via cross-validation. Several times, we
randomly split the workload into halves
W
1
and
W
2
,estimate
queryProb
(
a
)
using
W
1
and estimate the probability of
W
2
as
queryCount
W
2
(
a
)log
queryProb
W
1
(
a
)
a∈W
2
Results are shown in
Figure 10.19
; it is fairly easy to pick off a prominently
a strong effect on the quality of our index and the query performance.
10.4.2 Pre-Generalize and Post-Filter
Letthefullsetofatypesbe
A
and imagine that some subset
R
A
is
registered
. During indexing, tokens are attached to the type taxonomy
(see
Section 10.3.3.1
) and we walk up is-a links, only registered atypes are
included in the index. For example, in
Figure 10.20
, the heavily-shaded nodes
entity#n#1
and
living_thing#n#1
are in
R
, but the lightly-shaded nodes
are not.
Now suppose we are given an atype index for only the atypes in
R
,andgeta
query with atype
a
⊆
∈
R
. For example, corresponding to the natural language
Search WWH ::
Custom Search