Databases Reference
In-Depth Information
Figure 10.2
Effect of sampling on an aggregation query. (Image courtesy of IBM.)
Figure 10.2 [Haas 2003] shows how sampling was used to determine a distribution
of data by year in 2 seconds versus 4 minutes with very little loss in accuracy, while pro-
ducing the result 100 times faster.
Similarly, Figure 10.3 [Haas 2003] illustrates sampling a million-row data set and
performing linear regression. Notice how a 0.01% sample produced almost exactly the
same results as a full table scan, though it has the potential of running ten thousand
times faster.
A variety of alternatives to “pure” sampling are based on various synopses of the
data that are computed during a complete scan. See, for example, Acharya et al. [1999],
Chaudhuri et al. [1999], and Ganguly [1996]. Although such methods can yield very
accurate estimates, the synopses are expensive to compute, impose continual storage
space requirements, must be incrementally maintained over time, and allow relatively
little user control over estimation accuracy. These synopses are therefore not currently
supported by commercial database systems.
10.2.2 Sampling for Database Design
Sampling is supported in SQL via the TABLESAMPLE clause, which allows Bernoulli
sampling or system sampling choices, and a sampling rate. These types of sampling will
be described below in the next section. An example of the SQL syntax follows, illustrat-
ing a 5% sample of the EMP table:
SELECT SUM(BONUS) FROM EMP TABLESAMPLE BERNOULLI (5)
WHERE DEPTNAME = 'H917'
Sampling works extremely well when the objective of sampling is to determine a
distribution. When sampling is used to determine a count, however, serious problems
arise, because the sample may include all of the distinct values or only very few. Unfor-
Search WWH ::




Custom Search