Database Reference

In-Depth Information

is that each cell in an MDC (which is composed of consecutive pages in disk)

is bound to have some amount of free space in its last page. In the worst

case, if each tuple in a table requires its own cell, the MDC would require one

page per tuple, which can be much larger than the space required by the orig-

inal table. For that reason, a common problem formulation requires that each

recommended MDC satisfy a storage constraint. Specifically, it is common to

ask that any valid MDC has at most 10% space overhead over the original

table. Given this problem formulation, we next briefly discuss an automated

technique to recommend a good MDC configuration for a given workload:

1. Approximate the best-case benefit for each candidate dimension column,

which involves:

a. Determine columns of interest, including those that are used for

multidimensional predicates, grouping, and ordering clauses.

b. Determine the minimum granularity for each interesting column

that would result in at most 10% storage overhead. If we assume

that there are
k
cells in the MDC and that the last page in each

cell is on average
P
% full (50% is the expected value, but we can

use a more conservative percentage like 65%), then the storage

overhead is
k

pageSize
. This formula can be used to bound

the maximum number of cells (
k
max
) in an MDC for a given storage

constraint. In turn,
k
max
is also the maximum number of distinct

values that we can consider for any dimension column (because the

number of distinct values of a combination of columns is no smaller

than the number of distinct values of each column in isolation). We

can then determine the computed value for a dimension column

as its position in an equi-width histogram with
k
max
buckets that

covers the column domain. That is, if column
a
has values ranging

from
a
min
to
a
max
, we define the computed column
a

·

P
%

·

=
(

a

−

a
min
)/(

a
max
−

a
min
)
∗

k
max
(if
a
has fewer than
k
max
distinct values,

then
a

a
). This constitutes the finest level of granularity for

column
a
in the MDC that satisfies the storage constraint.

c. Obtain a
benefit
value for each dimension column. We approximate

the best-case benefit for a given dimension column by optimizing

each query in the workload assuming the finest granularity ob-

tained as already explained and also by assuming that every tuple

in the dimension column maps to a unique cell (which should be

the worst-case granularity for the column). The benefit is then

calculated as relative reduction in query costs between these two

extremes.

2. Model the benefit of candidate dimensions for varying granularities. So

far we have obtained the benefit of each query in the workload under the

finest and coarsest granularities. In practice we might want to cluster our

data at some intermediate granularity to arrive at a better configuration.

=

Search WWH ::

Custom Search