Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications - page 264

Database Reference

In-Depth Information

Size (GB)

Corpus/index

Original corpus

5.72

Gzipped corpus

1.33

Stem index

0.91

Full atype index

4.30

FIGURE 10.17 :

Relative sizes of the corpus and various indexes for

TREC 2000.

exists in the frequency of atypes, and whether we can exploit said skew to

avoid indexing a large fraction of types that appear in the type hierarchy.

In our earlier example of token CEO appearing in a document, we may

choose to index only a few of its hypernym ancestors, say, executive#n#1 ,

administrator#n#1 and person#n#1 , because the query log has few or

no occurrences of atype causal_agent#n#1 . The frequency counts in

Figure 10.18 seem to corroborate that there is, indeed, a great deal of skew

in query atypes.

Freq Query atype

100

Freq Query atype

5

integer#n#1

president#n#2

78

5

location#n#1

inventor#n#1

77

4

person#n#1

astronaut#n#1

20

4

city#n#1

creator#n#2

10

4

name#n#1

food#n#1

7

4

author#n#1

mountain#n#1

7

4

company#n#1

musical_instrument#n#1

6

4

actor#n#1

newspaper#n#1

6

4

date#n#1

sweetener#n#1

6

4

number#n#1

time_period#n#1

6

4

state#n#2

word#n#1

5

monarch#n#1

3

state#n#1

5

movie#n#1

3

university#n#1

FIGURE 10.18 : Highly skewed atype frequencies in TREC query logs.

However, as is well appreciated in the information retrieval, language

modeling and Web search communities, the distribution of query atype

frequencies is actually heavy-tailed , meaning that a substantial probability

mass is occupied by rare atypes (unlike, say, in an exponential tail). This

means that, even if we “train” our system over large query logs, we will always

be surprised in subsequent deployment by atypes we never saw in the training

set, and this will happen often enough to damage our aggregate performance.

Next Page

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home