Counting and Data Sampling in Physical Design Exploration - Physical Database Design

Databases Reference

In-Depth Information

Figure 10.2

Effect of sampling on an aggregation query. (Image courtesy of IBM.)

Figure 10.2 [Haas 2003] shows how sampling was used to determine a distribution

of data by year in 2 seconds versus 4 minutes with very little loss in accuracy, while pro-

ducing the result 100 times faster.

Similarly, Figure 10.3 [Haas 2003] illustrates sampling a million-row data set and

performing linear regression. Notice how a 0.01% sample produced almost exactly the

same results as a full table scan, though it has the potential of running ten thousand

times faster.

A variety of alternatives to “pure” sampling are based on various synopses of the

data that are computed during a complete scan. See, for example, Acharya et al. [1999],

Chaudhuri et al. [1999], and Ganguly [1996]. Although such methods can yield very

accurate estimates, the synopses are expensive to compute, impose continual storage

space requirements, must be incrementally maintained over time, and allow relatively

little user control over estimation accuracy. These synopses are therefore not currently

supported by commercial database systems.

10.2.2 Sampling for Database Design

Sampling is supported in SQL via the TABLESAMPLE clause, which allows Bernoulli

sampling or system sampling choices, and a sampling rate. These types of sampling will

be described below in the next section. An example of the SQL syntax follows, illustrat-

ing a 5% sample of the EMP table:

SELECT SUM(BONUS) FROM EMP TABLESAMPLE BERNOULLI (5)

WHERE DEPTNAME = 'H917'

Sampling works extremely well when the objective of sampling is to determine a

distribution. When sampling is used to determine a count, however, serious problems

arise, because the sample may include all of the distinct values or only very few. Unfor-

Search WWH ::

Custom Search

Home