Counting and Data Sampling in Physical Design Exploration - Physical Database Design

Databases Reference

In-Depth Information

partition the data have sufficiently high cardinality of distinct values so that skew across

the data nodes is unlikely.

Another factor in avoiding skew is the fairness of the hash algorithm, which is usu-

ally not a problem, except in very rare scenarios. The database designer can generally

assume with confidence that the database's hashing function has been chosen carefully

and is fair. The major cause of skew is a poor ratio of distinct values in the partitioning

columns versus the number of partitions in the system. In general the number of dis-

tinct values in the partitioning column should be at least 20 to 100 times the number of

partitions in the database system. If the count of distinct values is too low consider add-

ing additional column(s) to the partitioning columns (partitioning key) so that the

hashing function can balance data more evenly. As before, the count of distinct values of

the partitioning key can be determined through the SQL.

SELECT COUNT(*) FROM (SELECT DISTINCT COL1, COL2 FROM

MY_FAV_TABLE) AS PARTITIONING_VALUES;

Another loose indicator in the selection of the partitioning key is the ratio of dis-

tinct values in the key to total rows. If the ratio is low there are many duplicates in the

data and the partitioning is subject to skew. As a rule of thumb, a ratio of better than

7:10 should be a design point. However, this rule is very approximate. If there is heavy

skew in the 30% of nonunique values (e.g., consider the pathological case where 70% of

the partitioning key values were unique and all of the remaining partitioning values had

a single common value, then 30% of the data would hash to the same data node!), then

even the 7:10 rule may not be adequate. Conversely if the data is evenly distributed so

that every distinct value has roughly the same number of duplicates then a very low

ratio of distinct value may be perfectly fine.

10.2

The Power of Sampling

10.2.1 The Benefits of Sampling with SQL

The act of counting can itself be time consuming over very large data sets. In many

cases the performance of the count can be dramatically improved and the accuracy little

changed if the data is sampled.

Certainly, in terms of counting the relative numbers of values, sampling has been

shown to be highly effective with little loss in accuracy over many varied data sets. SQL

offers sampling syntax, supported by a number of vendors.

Figures 10.2 and 10.3 illustrate the effectiveness of sampling in determining rela-

tive amounts of data with very good accuracy and dramatic performance benefits.

Search WWH ::

Custom Search

Home