Databases Reference
In-Depth Information
Table 10.3
Failure of System (Page-level) Sampling for Clustered Data
Stratification is the process of grouping the data to be sampled into subgroups
before sampling. Stratified sampling can reduce the chances of poor results caused by
possible bad luck of the “random” choices being poorly distributed. Because random
sampling is truly random it's possible that a random selection could (in theory) dis-
proportionately choose records from a given range of the table—however improbable
that is. This is similar to rolling a die five times and landing on the same number each
time. Unlikely, but possible. With random sampling (Bernoulli or simple), the final
choice may not be well distributed even though the selection probabilities are fair. To
solve this problem, stratified sampling divides the table into ranges of pages and per-
forms random sampling within each range as shown in Figure 10.5. This ensures that
records are chosen from each range. Some database vendors support stratified sam-
pling, although it is not normally part of the SQL standard.
10.2.4 Repeatability with Sampling
The SQL standard supports the REPEATABLE keyword for use with the TABLESAM-
PLE clause, which allows the caller to specify a seed for the sampling process. This
makes it possible to perform sampling repeatedly, so that the same sample will be cho-
sen over the same data set every time, provided that the seed, data, sampling type, and
sampling rate are all held fixed. The REPEATABLE clause is very useful for debugging
and testing when using sampling. Sample syntax follows:
SELECT SUM(BONUS) FROM EMP TABLESAMPLE BERNOULLI (5)
REPEATABLE (Y) WHERE DEPTNAME = 'H917'
10.3
An Obvious Limitation
We've presented a fairly positive perspective on the use of SQL to count and sample
data in order to make effective database design choices. However, there is a chicken and
egg problem. All of the techniques above do require a database to exist that is populated
with data in order for SQL to be executed. In the early stages of database development a
Search WWH ::




Custom Search