Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

will consider the benefit of query time saved.

An exact estimate of inverted index size is dicult in the face of index

compression techniques (39). The posting list for an atype a (or a token in

general) has corpusCount ( a ) entries in it, so as a first approximation, it takes

space proportional to corpusCount ( a ). Therefore, if subset R is indexed, the

space needed can be approximated as

a∈R corpusCount ( a ) .

(10.8)

5.E+09

4.E+09

3.E+09

2.E+09

1.E+09

0.E+00

0.0E+00 5.0E+08 1.0E+09 1.5E+09 2.0E+09 2.5E+09

Estimated Index Size

FIGURE 10.22 : a∈R corpusCount ( a ) is a very good predictor of the size

of the atype subset index. (Root atypes are not indexed.)

Figure 10.22 shows that this crude approximation is surprisingly accurate.

This is probably because, averaged over many atypes, index compression

affects disk space by a fairly stable and uniform factor.

10.4.4 Query Time Bloat Model

Next we turn to the considerably more dicult task of estimating the factor

by which query execution slows down because only R ,not A ,thesetofall

atypes, has been indexed. This is dicult because, in general, the estimate

will depend on co-occurrence statistics between all possible atypes and all

possible words. In traditional relational database query optimization, where

the number of tables and attributes is modest, estimating multidimensional

“selectivity” of select and join predicates is a challenging problem (20).

Our testbed has over a million distinct tokens and some 18000 atypes

in A . Therefore, capturing correlations with any degree of thoroughness is

impossible, and simplifying assumptions must be made.

Query bloat happens in two stages: first, scanning the inverted index

posting lists takes longer because the posting list of the more general atype

g

∈

R is longer than the posting list of the query atype a ; and second,

Search WWH ::

Custom Search

Home