Database Reference
In-Depth Information
In a distributed architecture of this sort, there can be data skew or computational
skew. It is important that we select a distribution key with unique values and high
cardinality; we should also ensure that it will not result in computational skew.
With respect to high cardinality, typically boolean keys, for example, True/False or
Y/N, are not suitable for a distribution key as they will be primarily distributed to two
segment instances. In an MPP environment overall response time for a query is de-
pendent on the completion time for all segment instances.
There are two types of distribution policies that help divide rows among the available
segments:
Hash distribution : In this distribution technique, one or more table columns
are used as the distribution key. These columns are used by the hashing al-
gorithmtodividedataamongallofthesegments.Thekeyvalueishashed,or
a random number is created. There are performance advantages to choos-
ing a hash policy whenever possible. The largest performance advantages
come into play when joining two tables that use the same distribution key. In
this case the system does not have to shuffle data between nodes to do a
join.
Round robin distribution : When no distribution key is defined, this al-
gorithm is used. In this case, rows are sent to the segments as they come in.
This mechanism is usually used for smaller tables.
Hadoop (HD)
In order to handle the analytics for unstructured data, Greenplum UAP provides a
commercial version of Apache Hadoop. The HD distribution is integrated with Green-
plum Database and supports parallel analytics.
Hadoop is a framework that allows for distributed processing of large unstructured
data sets across clusters of commodity servers. It can store a large amount of data
and process the large amount of data stored.
Hadoop is originally an open source Apache Project that is implemented in Java.
The following figure depicts two core components of Hadoop:
Search WWH ::




Custom Search