Database Reference
In-Depth Information
This optimization problem can be shown to be NP hard via a reduction from
the knapsack problem, even when the type hierarchy is a tree. Therefore we
look for practical heuristics. We adopt a greedy approach of starting R with
only the roots of A 4 and progressively adding the locally “most profitable”
atype c . Here “profit” depends inversely on the additional space δS that
will be required by the posting list of c , and directly on the reduction δB of
expected bloat that will result from including c in R . Weusetheratio δB/δS
to pick the best c at every step.
AtypeSubsetChooser ( A, W )
1: R
←{
roots of A
}
, candidates C
A
\
R
r∈R corpusCount ( r )
3: using equations (10.7) and (10.10), expected bloat
B
2: initial estimated space S
a∈R∪C queryProb W ( a ) queryBloat ( a, R )
4: UpdateBloatsAndScores (
C, commit=false)
5: while R is small and/or B is large do
6:
c
C with the largest score ( c )
7: UpdateBloatsAndScores ( c, commit=true)
UpdateBloatsAndScores ( a, commitFlag)
1: B
choose c
S + corpusCount ( a )
2: “cousins” of a to be patched U
B , S
3: for each h
R, h
C, h IsA a do
b = queryBloat ( h, R ), b = queryBloat ( h, R
a )
4:
if b <b (bloat reduces) then
5:
B
( b
b ) queryProb W ( h )
6:
if commitFlag then
7:
U
U
∪{
g : g
C, g
= a, h IsA g
}
8:
B ) / ( S
9: score ( a )
( B
S )
10: if commitFlag then
11:
move a from C to R
S , B
B
S
12:
UpdateBloatsAndScores (
u
U, commit=false)
13:
FIGURE 10.27 : The inputs are atype set A and workload W . The output is
a series of trade-offs between index size of R and average query bloat over W .
Once c is included, each
The pseudocode is shown in Figure 10.27.
4 Including the roots is only notional. Root types are so frequent in a typical corpus that
if generalization takes us to a root type it basically means we must scan the corpus end to
end. Therefore, any reasonable R will prevent this event.
Search WWH ::




Custom Search