Databases Reference
In-Depth Information
partition the data have sufficiently high cardinality of distinct values so that skew across
the data nodes is unlikely.
Another factor in avoiding skew is the fairness of the hash algorithm, which is usu-
ally not a problem, except in very rare scenarios. The database designer can generally
assume with confidence that the database's hashing function has been chosen carefully
and is fair. The major cause of skew is a poor ratio of distinct values in the partitioning
columns versus the number of partitions in the system. In general the number of dis-
tinct values in the partitioning column should be at least 20 to 100 times the number of
partitions in the database system. If the count of distinct values is too low consider add-
ing additional column(s) to the partitioning columns (partitioning key) so that the
hashing function can balance data more evenly. As before, the count of distinct values of
the partitioning key can be determined through the SQL.
SELECT COUNT(*) FROM (SELECT DISTINCT COL1, COL2 FROM
MY_FAV_TABLE) AS PARTITIONING_VALUES;
Another loose indicator in the selection of the partitioning key is the ratio of dis-
tinct values in the key to total rows. If the ratio is low there are many duplicates in the
data and the partitioning is subject to skew. As a rule of thumb, a ratio of better than
7:10 should be a design point. However, this rule is very approximate. If there is heavy
skew in the 30% of nonunique values (e.g., consider the pathological case where 70% of
the partitioning key values were unique and all of the remaining partitioning values had
a single common value, then 30% of the data would hash to the same data node!), then
even the 7:10 rule may not be adequate. Conversely if the data is evenly distributed so
that every distinct value has roughly the same number of duplicates then a very low
ratio of distinct value may be perfectly fine.
10.2
The Power of Sampling
10.2.1 The Benefits of Sampling with SQL
The act of counting can itself be time consuming over very large data sets. In many
cases the performance of the count can be dramatically improved and the accuracy little
changed if the data is sampled.
Certainly, in terms of counting the relative numbers of values, sampling has been
shown to be highly effective with little loss in accuracy over many varied data sets. SQL
offers sampling syntax, supported by a number of vendors.
Figures 10.2 and 10.3 illustrate the effectiveness of sampling in determining rela-
tive amounts of data with very good accuracy and dramatic performance benefits.
Search WWH ::




Custom Search