Database Reference
In-Depth Information
This optimization problem can be shown to be NP hard via a reduction from
the knapsack problem, even when the type hierarchy is a tree. Therefore we
look for practical heuristics. We adopt a greedy approach of starting
R
with
only the roots of
A
4
and progressively adding the locally “most profitable”
atype
c
. Here “profit” depends inversely on the additional space
δS
that
will be required by the posting list of
c
, and directly on the reduction
δB
of
expected bloat that will result from including
c
in
R
. Weusetheratio
δB/δS
to pick the best
c
at every step.
AtypeSubsetChooser
(
A, W
)
1:
R
←{
roots of
A
}
, candidates
C
←
A
\
R
←
r∈R
corpusCount
(
r
)
3:
using equations (10.7) and (10.10), expected bloat
B
2:
initial estimated space
S
←
a∈R∪C
queryProb
W
(
a
)
queryBloat
(
a, R
)
4:
UpdateBloatsAndScores
(
C,
commit=false)
5:
while
R
is small and/or
B
is large
do
6:
∀
c
∈
C
with the largest
score
(
c
)
7:
UpdateBloatsAndScores
(
c,
commit=true)
UpdateBloatsAndScores
(
a,
commitFlag)
1:
B
←
choose
c
∈
S
+
corpusCount
(
a
)
2:
“cousins” of
a
to be patched
U
B
,
S
←
←
∅
3:
for
each
h
∈
R, h
∈
C, h
IsA
a
do
b
=
queryBloat
(
h, R
),
b
=
queryBloat
(
h, R
∪
a
)
4:
if
b
<b
(bloat reduces)
then
5:
B
←
(
b
−
b
)
queryProb
W
(
h
)
6:
if
commitFlag
then
7:
U
←
U
∪{
g
:
g
∈
C, g
=
a, h
IsA
g
}
8:
B
)
/
(
S
−
9:
score
(
a
)
←
(
B
−
S
)
10:
if
commitFlag
then
11:
move
a
from
C
to
R
S
,
B
B
S
←
←
12:
UpdateBloatsAndScores
(
∀
u
∈
U,
commit=false)
13:
FIGURE 10.27
: The inputs are atype set
A
and workload
W
. The output is
a series of trade-offs between index size of
R
and average query bloat over
W
.
Once
c
is included, each
The pseudocode is shown in Figure 10.27.
4
Including the roots is only notional. Root types are so frequent in a typical corpus that
if generalization takes us to a root type it basically means we must scan the corpus end to
end. Therefore, any reasonable
R
will prevent this event.
Search WWH ::
Custom Search