Implementing Analytics with Greenplum UAP - Getting Started with Greenplum for Big Data Analytics

Database Reference

In-Depth Information

Greenplum table distribution and

partitioning

In the following section, we will define table distribution in Greenplum context and de-

tail the other related aspects of distribution, like data skew.

Distribution

Greenplum is a massive parallel processing data store, and data is distributed across

segments as per the definition of the distribution strategy.

Every table in Greenplum has a data distribution method, the DISTRIBUTED BY

clause helps define the distribution strategy. We need to ensure that there is no

data skew introduced on any of the segment hosts as a result of the distribution key

defined.

There are two methods of distributing table data across segment hosts:

• Column oriented/Hash distribution : This is a distribution mechanism that

considers a column or a combination of columns to distribute data across seg-

ments:

DISTRIBUTED BY (column name(s))

• Random distribution : In this distribution mechanism data would be distrib-

uted across the segment servers in a round robin fashion. In this approach

there wouldn't be any data skew on the segments. For any table that uses

a random distribution, either redistribution or broadcast operation will be re-

quired to perform a table join. There are performance implications when per-

forming a redistribution or broadcast of very large tables. Random distribution

should be used for small tables and when a Hash distribution method is not

feasible due to significant data skew:

DISTRIBUTED RANDOMLY

Distribution key can be modified at any point of time. In case the table has any unique

key, that key needs to be considered in the distributed key. User-defined data types

Search WWH ::

Custom Search

Home