Counting and Data Sampling in Physical Design Exploration - Physical Database Design

Databases Reference

In-Depth Information

While the Hybrid estimator clearly provides superior estimates, the First-order

Jackknife estimator does a relatively good job while being much easier to compute. Also,

although the First-order Jackknife estimator has worse average-case behavior than the

Hybrid estimator, experiments in Haas [1998] show that the former estimator is more

robust in the presence of data skew, in the sense that it has better worst-case perfor-

mance. The First-order Jackknife can therefore be a useful tool for database design prob-

lems, provided the database designer keeps in mind that this estimator often has up to

double-digit RMSE, and can be highly error prone at low sampling rates.

Table 10.2

Qualitative Comparison of the Hybrid and First-order Jackknife Estimators

One of the places where sampling can be very powerful in database design is in the

area of materialized-view size estimation where, as in our previous example, the view

definition does not have any aggregation functions (such as SUM or COUNT) in the

SELECT clause. In this case, estimation of the number of distinct values is not

required. For a materialized view that is defined on a single base table, the size can usu-

ally be predicted with good accuracy using sampling, by multiplying the number of

rows returned from the sample by the inverse of the sampling rate. When the material-

ized view comprises a select-join query on two or more base tables, the size can again be

predicted by multiplying the number of rows in the sample by the inverse of the “effec-

tive” sampling rate, where the effective sampling rate is defined as the product of the

sampling rates over all sampled tables [Haas 1996]. The difficulty here is that the accu-

racy of the estimate is very sensitive to the choice of sampling rates, and it is very hard to

determine a good allocation of the overall sampling effort a priori. The most effective

sampling techniques learn the optimal sampling rates as they go and adapt their behav-

ior accordingly [Haas 1996, Haas 1999]. Current database systems do not yet support

such complex sampling schemes.

The situation becomes even more complicated when the view definition starts to

invoke the full power of SQL, with correlated subqueries and other advanced features.

View-size estimation in these complex scenarios remains an open problem. Fortunately,

there exists a very important and common multitable scenario where sampling can be

applied in a straightforward way. An example is given by the EMP table defined previ-

ously. In this case, the Department table is most likely a small table with a short list of

departments and the bulk of the data is coming from the Employee table. Because the

Search WWH ::

Custom Search

Home