Counting and Data Sampling in Physical Design Exploration - Physical Database Design

Databases Reference

In-Depth Information

index leaf pages. In the usual case, this requires one I/O per selected RID, and one I/O

per index leaf page.

System sampling is defined in the SQL standard to be a vendor-dependent sam-

pling method. In practice, system sampling is implemented by virtually all vendors as

similar to Bernoulli sampling, except that it samples database data at the level of I/O

units (e.g., storage pages) instead of sampling at the record level. For example, a 5% sys-

tem sample of a table would sample 5% of the storage pages. Sampling pages instead of

records dramatically improved the performance of the sampling process, as shown in

Figure 10.4. A page is included in the sample with a probability of P /100. If a page is

included, all of the rows in that page are included. However, if the data has any signifi-

cant clustering in the table a system sample can lead to seriously erroneous results.

Performance of system (page-level) sampling is excellent because only one I/O is

required for each page that is included in the sample. Compared with no sampling, sys-

tem page-level sampling can improve performance by orders of magnitude.

Figure 10.4

System level sampling in SQL. (Image courtesy of IBM.)

However, the accuracy of aggregate estimates tends to be worse under system page-

level sampling than row-level sampling. This disparity in accuracy is most pronounced

when there are many rows per block or when the columns referenced in the query

exhibit a high degree of clustering within the pages. I/O sampling on a table with signif-

icant clustering (i.e., having a good cluster ratio and therefore storing similar data

together on the same or nearby pages on disk) can lead to dramatic inaccuracies. For

example, Haas [2003] used the following example to show the massive inaccuracies that

can result from using system sampling on clustered data. Haas defined a table with one

million rows, with 100 records per page. He performed experiments to take a 1% sam-

ple on a numeric column in the table using both row-level sampling and system sam-

pling and for each. The sum of the data in the column was then estimated by multiply-

ing the sum of the sample data by 100 (because a 1% sample was used). The estimate

based on applying system sampling to clustered data was an order of magnitude too

large. See Table 10.3 and Haas [2004] for a further discussion of these issues.

Search WWH ::

Custom Search

Home