Designing a Cost Model - Automated Physical Database Design and Tuning

Database Reference

In-Depth Information

the sample. In general, a uniform random sample of 1 to 10% can effec-

tively require reading all pages of a given table. For that reason, rather

than performing a uniform random sample of tuples in a table, we can

randomly sample pages in disk and use all tuples in the page as part

of the sample. Of course, this mechanism works whenever there is no

correlation between the values of the column and the page the tuple be-

longs to, which is not always the case. A common approach to remedy

this problem is to use an adaptive page-l ev el sampling algorithm. The

idea is to start with a “seed” sample of √ P pages, where P is t he num-

ber of pages in the table. We then retrieve a new sample of √ P pages

for cross-validation purposes. If it is determined that both samples are

not statistically similar, we combine both samples into one and repeat

the procedure of obtaining a new sample. When the cross-validation

step is successful, the accumulated sample is used to obtain the final

histogram. Each hypothetical index creation thus requires that we sam-

ple the corresponding table and build a histogram on the index key

column.

Decoupling of statistics and indexes. While using sampling to create

histograms significantly improves performance, we still need to obtain

a random sample for each index that is created. Alternatively, we can

decouple statistics and the corresponding indexes and build statistics

only once for each column. In this case, histograms would be shared

among any index that shares the same leading key column. Additionally,

if some histogram already exists in the DBMS (due to an existing index,

or because the histogram itself was manually generated) we do not need

to duplicate it and can reuse the existing histogram instead.

Sharing multicolumn statistics. If the DBMS supports multicolumn

statistics, the techniques discussed previously need to be extended. The

reason is that different indexes with the same leading column would be

associated with different statistics that cannot be shared. Depending on

the properties of such multicolumn statistics, however, we might be able

to improve the basic scheme. For instance, some multicolumn statistics

(e.g., multidimensional histograms) satisfy some notion of “commutativ-

ity” (i.e., statistics on (

) are essentially the same). Other

commonly used statistics (e.g., a single-column histogram plus the num-

ber of distinct values for each prefix of columns in a given set) satisfy

a “containment” property (i.e.,

) and (

(

)

contains all the information of

(

). These properties allow us to reduce the set of statistics

to materialize without significantly compromising quality. As an exam-

ple, suppose that we want to consider indexes on

)

and

(

)

(

)

(

)

(

)

(

)

and

. In principle, we would need to create all five multicolumn

statistics. However, if statistics satisfy the containment property, we can

obtain the same result by materializing only

(

)

. Obtain-

ing the minimal set of statistics to create is a challenging problem, and

(

)

and

(

)

Automated Physical Database Design and Tuning

Search WWH ::

Custom Search

Home