Database Reference
In-Depth Information
the sample. In general, a uniform random sample of 1 to 10% can effec-
tively require reading all pages of a given table. For that reason, rather
than performing a uniform random sample of tuples in a table, we can
randomly sample pages in disk and use all tuples in the page as part
of the sample. Of course, this mechanism works whenever there is no
correlation between the values of the column and the page the tuple be-
longs to, which is not always the case. A common approach to remedy
this problem is to use an adaptive page-l ev el sampling algorithm. The
idea is to start with a “seed” sample of P pages, where P is t he num-
ber of pages in the table. We then retrieve a new sample of P pages
for cross-validation purposes. If it is determined that both samples are
not statistically similar, we combine both samples into one and repeat
the procedure of obtaining a new sample. When the cross-validation
step is successful, the accumulated sample is used to obtain the final
histogram. Each hypothetical index creation thus requires that we sam-
ple the corresponding table and build a histogram on the index key
column.
Decoupling of statistics and indexes. While using sampling to create
histograms significantly improves performance, we still need to obtain
a random sample for each index that is created. Alternatively, we can
decouple statistics and the corresponding indexes and build statistics
only once for each column. In this case, histograms would be shared
among any index that shares the same leading key column. Additionally,
if some histogram already exists in the DBMS (due to an existing index,
or because the histogram itself was manually generated) we do not need
to duplicate it and can reuse the existing histogram instead.
Sharing multicolumn statistics. If the DBMS supports multicolumn
statistics, the techniques discussed previously need to be extended. The
reason is that different indexes with the same leading column would be
associated with different statistics that cannot be shared. Depending on
the properties of such multicolumn statistics, however, we might be able
to improve the basic scheme. For instance, some multicolumn statistics
(e.g., multidimensional histograms) satisfy some notion of “commutativ-
ity” (i.e., statistics on (
) are essentially the same). Other
commonly used statistics (e.g., a single-column histogram plus the num-
ber of distinct values for each prefix of columns in a given set) satisfy
a “containment” property (i.e.,
a
,
b
) and (
b
,
a
(
a
,
b
,
c
)
contains all the information of
(
). These properties allow us to reduce the set of statistics
to materialize without significantly compromising quality. As an exam-
ple, suppose that we want to consider indexes on
a
)
and
(
a
,
b
)
(
a
)
,
(
b
)
,
(
a
,
b
)
,
(
b
,
a
)
,
and
. In principle, we would need to create all five multicolumn
statistics. However, if statistics satisfy the containment property, we can
obtain the same result by materializing only
(
a
,
b
,
c
)
. Obtain-
ing the minimal set of statistics to create is a challenging problem, and
(
a
,
b
,
c
)
and
(
b
,
a
)
Search WWH ::




Custom Search