Databases Reference
In-Depth Information
index leaf pages. In the usual case, this requires one I/O per selected RID, and one I/O
per index leaf page.
System sampling is defined in the SQL standard to be a vendor-dependent sam-
pling method. In practice, system sampling is implemented by virtually all vendors as
similar to Bernoulli sampling, except that it samples database data at the level of I/O
units (e.g., storage pages) instead of sampling at the record level. For example, a 5% sys-
tem sample of a table would sample 5% of the storage pages. Sampling pages instead of
records dramatically improved the performance of the sampling process, as shown in
Figure 10.4. A page is included in the sample with a probability of P /100. If a page is
included, all of the rows in that page are included. However, if the data has any signifi-
cant clustering in the table a system sample can lead to seriously erroneous results.
Performance of system (page-level) sampling is excellent because only one I/O is
required for each page that is included in the sample. Compared with no sampling, sys-
tem page-level sampling can improve performance by orders of magnitude.
Figure 10.4
System level sampling in SQL. (Image courtesy of IBM.)
However, the accuracy of aggregate estimates tends to be worse under system page-
level sampling than row-level sampling. This disparity in accuracy is most pronounced
when there are many rows per block or when the columns referenced in the query
exhibit a high degree of clustering within the pages. I/O sampling on a table with signif-
icant clustering (i.e., having a good cluster ratio and therefore storing similar data
together on the same or nearby pages on disk) can lead to dramatic inaccuracies. For
example, Haas [2003] used the following example to show the massive inaccuracies that
can result from using system sampling on clustered data. Haas defined a table with one
million rows, with 100 records per page. He performed experiments to take a 1% sam-
ple on a numeric column in the table using both row-level sampling and system sam-
pling and for each. The sum of the data in the column was then estimated by multiply-
ing the sum of the sample data by 100 (because a 1% sample was used). The estimate
based on applying system sampling to clustered data was an order of magnitude too
large. See Table 10.3 and Haas [2004] for a further discussion of these issues.
Search WWH ::




Custom Search