Database Reference
In-Depth Information
is that each cell in an MDC (which is composed of consecutive pages in disk)
is bound to have some amount of free space in its last page. In the worst
case, if each tuple in a table requires its own cell, the MDC would require one
page per tuple, which can be much larger than the space required by the orig-
inal table. For that reason, a common problem formulation requires that each
recommended MDC satisfy a storage constraint. Specifically, it is common to
ask that any valid MDC has at most 10% space overhead over the original
table. Given this problem formulation, we next briefly discuss an automated
technique to recommend a good MDC configuration for a given workload:
1. Approximate the best-case benefit for each candidate dimension column,
which involves:
a. Determine columns of interest, including those that are used for
multidimensional predicates, grouping, and ordering clauses.
b. Determine the minimum granularity for each interesting column
that would result in at most 10% storage overhead. If we assume
that there are k cells in the MDC and that the last page in each
cell is on average P % full (50% is the expected value, but we can
use a more conservative percentage like 65%), then the storage
overhead is k
pageSize . This formula can be used to bound
the maximum number of cells ( k max ) in an MDC for a given storage
constraint. In turn, k max is also the maximum number of distinct
values that we can consider for any dimension column (because the
number of distinct values of a combination of columns is no smaller
than the number of distinct values of each column in isolation). We
can then determine the computed value for a dimension column
as its position in an equi-width histogram with k max buckets that
covers the column domain. That is, if column a has values ranging
from a min to a max , we define the computed column a
P %
= (
a min )/(
a max
a min )
k max (if a has fewer than k max distinct values,
then a
a ). This constitutes the finest level of granularity for
column a in the MDC that satisfies the storage constraint.
c. Obtain a benefit value for each dimension column. We approximate
the best-case benefit for a given dimension column by optimizing
each query in the workload assuming the finest granularity ob-
tained as already explained and also by assuming that every tuple
in the dimension column maps to a unique cell (which should be
the worst-case granularity for the column). The benefit is then
calculated as relative reduction in query costs between these two
2. Model the benefit of candidate dimensions for varying granularities. So
far we have obtained the benefit of each query in the workload under the
finest and coarsest granularities. In practice we might want to cluster our
data at some intermediate granularity to arrive at a better configuration.
Search WWH ::

Custom Search