Database Reference
In-Depth Information
Size (GB)
Corpus/index
Original corpus
5.72
Gzipped corpus
1.33
Stem index
0.91
Full atype index
4.30
FIGURE 10.17 :
Relative sizes of the corpus and various indexes for
TREC 2000.
exists in the frequency of atypes, and whether we can exploit said skew to
avoid indexing a large fraction of types that appear in the type hierarchy.
In our earlier example of token CEO appearing in a document, we may
choose to index only a few of its hypernym ancestors, say, executive#n#1 ,
administrator#n#1 and person#n#1 , because the query log has few or
no occurrences of atype causal_agent#n#1 . The frequency counts in
Figure 10.18 seem to corroborate that there is, indeed, a great deal of skew
in query atypes.
Freq Query atype
100
Freq Query atype
5
integer#n#1
president#n#2
78
5
location#n#1
inventor#n#1
77
4
person#n#1
astronaut#n#1
20
4
city#n#1
creator#n#2
10
4
name#n#1
food#n#1
7
4
author#n#1
mountain#n#1
7
4
company#n#1
musical_instrument#n#1
6
4
actor#n#1
newspaper#n#1
6
4
date#n#1
sweetener#n#1
6
4
number#n#1
time_period#n#1
6
4
state#n#2
word#n#1
5
monarch#n#1
3
state#n#1
5
movie#n#1
3
university#n#1
FIGURE 10.18 : Highly skewed atype frequencies in TREC query logs.
However, as is well appreciated in the information retrieval, language
modeling and Web search communities, the distribution of query atype
frequencies is actually heavy-tailed , meaning that a substantial probability
mass is occupied by rare atypes (unlike, say, in an exponential tail). This
means that, even if we “train” our system over large query logs, we will always
be surprised in subsequent deployment by atypes we never saw in the training
set, and this will happen often enough to damage our aggregate performance.
 
Search WWH ::




Custom Search