Greenplum Unified Analytics Platform (UAP) - Getting Started with Greenplum for Big Data Analytics

Database Reference

In-Depth Information

In a distributed architecture of this sort, there can be data skew or computational

skew. It is important that we select a distribution key with unique values and high

cardinality; we should also ensure that it will not result in computational skew.

With respect to high cardinality, typically boolean keys, for example, True/False or

Y/N, are not suitable for a distribution key as they will be primarily distributed to two

segment instances. In an MPP environment overall response time for a query is de-

pendent on the completion time for all segment instances.

There are two types of distribution policies that help divide rows among the available

segments:

• Hash distribution : In this distribution technique, one or more table columns

are used as the distribution key. These columns are used by the hashing al-

gorithmtodividedataamongallofthesegments.Thekeyvalueishashed,or

a random number is created. There are performance advantages to choos-

ing a hash policy whenever possible. The largest performance advantages

come into play when joining two tables that use the same distribution key. In

this case the system does not have to shuffle data between nodes to do a

join.

• Round robin distribution : When no distribution key is defined, this al-

gorithm is used. In this case, rows are sent to the segments as they come in.

This mechanism is usually used for smaller tables.

Hadoop (HD)

In order to handle the analytics for unstructured data, Greenplum UAP provides a

commercial version of Apache Hadoop. The HD distribution is integrated with Green-

plum Database and supports parallel analytics.

Hadoop is a framework that allows for distributed processing of large unstructured

data sets across clusters of commodity servers. It can store a large amount of data

and process the large amount of data stored.

Hadoop is originally an open source Apache Project that is implemented in Java.

The following figure depicts two core components of Hadoop:

Search WWH ::

Custom Search

Home