Big Data Computing Applications - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

partitioning of data by row in relational databases is not new and is referred

to as horizontal partitioning in parallel database technology. The distinc-

tion between sharding and horizontal partitioning is that horizontal par-

titioning is done transparently to the application by the database, whereas

sharding is explicit partitioning done by the application. However, the two

techniques have started converging, since traditional database vendors

have started offering support for more sophisticated partitioning strategies.

Since sharding is similar to horizontal partitioning, we first discuss differ-

ent horizontal partitioning techniques. It can be seen that a good sharding

technique depends upon both the organization of the data and the type of

queries expected.

The different techniques of sharding are as follows:

1. Round-robin partitioning : The round-robin method distributes the

rows in a round-robin fashion over different databases. In the exam-

ple, we could partition the transaction table into multiple databases

so that the first transaction is stored in the first database, the second

in the second database, and so on. The advantage of round-robin

partitioning is its simplicity. However, it also suffers from the disad-

vantage of losing associations (say) during a query, unless all data-

bases are queried. Hash partitioning and range partitioning do not

suffer from the disadvantage of losing record associations.

2. Hash partitioning method : In this method, the value of a selected

attribute is hashed to find the database into which the tuple should

be stored. If queries are frequently made on an attribute (say

Customer_Id), then associations can be preserved by using this attri-

bute as the attribute that is hashed, so that records with the same

value of this attribute can be found in the same database.

3. Range partitioning : The range partitioning technique stores records

with similar attributes in the same database. For example, the range

of Customer_Id could be partitioned between different databases.

Again, if the attributes chosen for grouping are those on which que-

ries are frequently made, record association is preserved and it is not

necessary to merge results from different databases. Range partition-

ing can be susceptible to load imbalance, unless the partitioning is

chosen carefully. It is possible to choose the partitions so that there

is an imbalance in the amount of data stored in the partitions (data

skew) or in the execution of queries across partitions (execution skew).

These problems are less likely in round-robin and hash partitioning,

since they tend to uniformly distribute the data over the partitions.

Thus, hash partitioning is particularly well suited to large-scale systems.

Round-robin simplifies a uniform distribution of records but does not facili-

tate the restriction of operations to single partitions. While range partitioning

Search WWH ::

Custom Search

Home