Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Therefore, our first task, in Section 10.4.1, will be to turn raw atype

frequencies from the query log into a smoother distribution over atypes.

Second, in Section 10.4.2 we will formalize our strategy of indexing only a

suitably-chosen subset of atypes; in particular, how to adapt to missing atypes

at query time. Having fixed the query execution template, we will engage in

two modeling tasks: estimating the space saved by indexing only a subset

of atypes (Section 10.4.3) and estimating the query time blow-up because

all atypes were not indexed (Section 10.4.4). Armed with these models, in

Section 10.4.5, we will propose a simple but effective algorithm to choose the

atype subset to index. Finally, we will describe experimental performance in

Section 10.4.6.

10.4.1 Probability of a Query Atype

The atype subset selection algorithm we propose uses an estimate of the

probability of seeing an atype a in a new query, queryProb ( a ). For WordNet

alone, a can have over 18,000 (non-leaf) values, and the skew makes it dicult

to estimate the probabilities of rare atypes.

This is a standard issue in language modeling (29). The solution is to reserve

and distribute a tiny bit of probability over all atypes not seen in training data.

We use the well-known Lidstone smoothing formula to implement this:

queryCount ( a )+

a queryCount ( a )+

queryProb ( a )=

(10.7)

where 0 <≤ 1 is a parameter to be set via cross-validation. Several times, we

randomly split the workload into halves W 1 and W 2 ,estimate queryProb ( a )

using W 1 and estimate the probability of W 2 as

queryCount W 2 ( a )log queryProb W 1 ( a )

a∈W 2

Results are shown in Figure 10.19 ; it is fairly easy to pick off a prominently

best for a given dataset. We shall see later in Section 10.4.6 that has quite

a strong effect on the quality of our index and the query performance.

10.4.2 Pre-Generalize and Post-Filter

Letthefullsetofatypesbe A and imagine that some subset R

A is

registered . During indexing, tokens are attached to the type taxonomy

(see Section 10.3.3.1 ) and we walk up is-a links, only registered atypes are

included in the index. For example, in Figure 10.20 , the heavily-shaded nodes

entity#n#1 and living_thing#n#1 are in R , but the lightly-shaded nodes

are not.

Now suppose we are given an atype index for only the atypes in R ,andgeta

query with atype a

⊆

∈

R . For example, corresponding to the natural language

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home