Database Reference
In-Depth Information
Greenplum table distribution and
partitioning
In the following section, we will define table distribution in Greenplum context and de-
tail the other related aspects of distribution, like data skew.
Distribution
Greenplum is a massive parallel processing data store, and data is distributed across
segments as per the definition of the distribution strategy.
Every table in Greenplum has a data distribution method, the DISTRIBUTED BY
clause helps define the distribution strategy. We need to ensure that there is no
data skew introduced on any of the segment hosts as a result of the distribution key
defined.
There are two methods of distributing table data across segment hosts:
Column oriented/Hash distribution : This is a distribution mechanism that
considers a column or a combination of columns to distribute data across seg-
ments:
DISTRIBUTED BY (column name(s))
Random distribution : In this distribution mechanism data would be distrib-
uted across the segment servers in a round robin fashion. In this approach
there wouldn't be any data skew on the segments. For any table that uses
a random distribution, either redistribution or broadcast operation will be re-
quired to perform a table join. There are performance implications when per-
forming a redistribution or broadcast of very large tables. Random distribution
should be used for small tables and when a Hash distribution method is not
feasible due to significant data skew:
DISTRIBUTED RANDOMLY
Distribution key can be modified at any point of time. In case the table has any unique
key, that key needs to be considered in the distributed key. User-defined data types
Search WWH ::




Custom Search