Databases Reference
In-Depth Information
While the Hybrid estimator clearly provides superior estimates, the First-order
Jackknife estimator does a relatively good job while being much easier to compute. Also,
although the First-order Jackknife estimator has worse average-case behavior than the
Hybrid estimator, experiments in Haas [1998] show that the former estimator is more
robust in the presence of data skew, in the sense that it has better worst-case perfor-
mance. The First-order Jackknife can therefore be a useful tool for database design prob-
lems, provided the database designer keeps in mind that this estimator often has up to
double-digit RMSE, and can be highly error prone at low sampling rates.
Table 10.2
Qualitative Comparison of the Hybrid and First-order Jackknife Estimators
One of the places where sampling can be very powerful in database design is in the
area of materialized-view size estimation where, as in our previous example, the view
definition does not have any aggregation functions (such as SUM or COUNT) in the
SELECT clause. In this case, estimation of the number of distinct values is not
required. For a materialized view that is defined on a single base table, the size can usu-
ally be predicted with good accuracy using sampling, by multiplying the number of
rows returned from the sample by the inverse of the sampling rate. When the material-
ized view comprises a select-join query on two or more base tables, the size can again be
predicted by multiplying the number of rows in the sample by the inverse of the “effec-
tive” sampling rate, where the effective sampling rate is defined as the product of the
sampling rates over all sampled tables [Haas 1996]. The difficulty here is that the accu-
racy of the estimate is very sensitive to the choice of sampling rates, and it is very hard to
determine a good allocation of the overall sampling effort a priori. The most effective
sampling techniques learn the optimal sampling rates as they go and adapt their behav-
ior accordingly [Haas 1996, Haas 1999]. Current database systems do not yet support
such complex sampling schemes.
The situation becomes even more complicated when the view definition starts to
invoke the full power of SQL, with correlated subqueries and other advanced features.
View-size estimation in these complex scenarios remains an open problem. Fortunately,
there exists a very important and common multitable scenario where sampling can be
applied in a straightforward way. An example is given by the EMP table defined previ-
ously. In this case, the Department table is most likely a small table with a short list of
departments and the bulk of the data is coming from the Employee table. Because the
Search WWH ::




Custom Search