Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications - page 270

Database Reference

In-Depth Information

because we are now obliged to screen the results using expensive forward

index accesses.

For the first part, we assume that the time spent scanning posting of the

atype a and intersecting them with selector postings takes time proportional

to corpusCount ( a ), independent of what the specific selectors are. This is

confirmed by Figure 10.23.

Histogram of Multimerge Access Time

80

60

40

20

0

0

0.05

0.1

0.15

0.2

0.25

Average Time (msec)

Histogram of Forward Index Access Time

60

45

30

15

0

0

2

4

6

8

Average Time (msec)

FIGURE 10.23 : t scan is suciently concentrated that replacing the

distribution by a constant number is not grossly inaccurate.

The second part depends on the average time t forward it takes to probe the

forward index for one document and do the reachability test, and on k ,the

number of results sought from the pre-generalized query. Like t scan , t forward is

also suciently peaked and centered to use a point estimate ( Figure 10.24 ).

The overall query bloat factor is therefore

t scan corpusCount ( g )+ k t forward

t scan corpusCount ( a )

Now we come to the question of what k should be. If we make the crude

Next Page

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home