Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

This optimization problem can be shown to be NP hard via a reduction from

the knapsack problem, even when the type hierarchy is a tree. Therefore we

look for practical heuristics. We adopt a greedy approach of starting R with

only the roots of A 4 and progressively adding the locally “most profitable”

atype c . Here “profit” depends inversely on the additional space δS that

will be required by the posting list of c , and directly on the reduction δB of

expected bloat that will result from including c in R . Weusetheratio δB/δS

to pick the best c at every step.

AtypeSubsetChooser ( A, W )

1: R

←{

roots of A

}

, candidates C

←

← r∈R corpusCount ( r )

3: using equations (10.7) and (10.10), expected bloat

2: initial estimated space S

← a∈R∪C queryProb W ( a ) queryBloat ( a, R )

4: UpdateBloatsAndScores (

C, commit=false)

5: while R is small and/or B is large do

∀

∈

C with the largest score ( c )

7: UpdateBloatsAndScores ( c, commit=true)

UpdateBloatsAndScores ( a, commitFlag)

1: B ←

choose c

∈

S + corpusCount ( a )

2: “cousins” of a to be patched U

B , S ←

← ∅

3: for each h

∈

R, h

∈

C, h IsA a do

b = queryBloat ( h, R ), b = queryBloat ( h, R

∪

a )

if b <b (bloat reduces) then

B ←

( b −

b ) queryProb W ( h )

if commitFlag then

←

∪{

g : g

∈

C, g

= a, h IsA g

}

B ) / ( S −

9: score ( a )

←

( B

−

S )

10: if commitFlag then

11:

move a from C to R

S , B

←

12:

UpdateBloatsAndScores (

∀

∈

U, commit=false)

13:

FIGURE 10.27 : The inputs are atype set A and workload W . The output is

a series of trade-offs between index size of R and average query bloat over W .

Once c is included, each

The pseudocode is shown in Figure 10.27.

4 Including the roots is only notional. Root types are so frequent in a typical corpus that

if generalization takes us to a root type it basically means we must scan the corpus end to

end. Therefore, any reasonable R will prevent this event.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home