Database Reference
In-Depth Information
cannot be included into distributed key. Every table has a default distribution strategy
and we should not be using that as it may introduce skew. In Hash distributions, a lot
of care needs to be taken to ensure there is no data skew seen. The following are
few important considerations while defining Hash distribution keys:
• Use keys with unique values and high cardinality to distribute the data evenly
across all segment instances.
• Avoid Boolean keys like T / F , Y / N , or 1 / 0 .
• With data skew problem solved, ensure there wouldn't be computational
skew. Let us take an example where the distribution key had a DATE column
and the data is always evenly distributed; and in case where there are quer-
ies to get and process data for a particular month, it could happen that a par-
ticular segment will end up serving all the requests, thus resulting in compu-
tational skew.
Distribution key definition should depend on the data schema structure and most
common querying patterns. Commonly used joined tables should use same data
types for distribution keys. One important note that database designers should con-
sider is that, every time a query is fired that might involve joining data between seg-
ment server, Greenplum would internally do a co-location exercise or redistribution
motion to respond to the query request, that would be destroyed post communicat-
ing the results to the client.
The co-location or redistribution motion is all about the process, where to perform
a local join matching rows must be located together on the same segment instance
and in absence of which a dynamic redistribution of the needed rows from one of
the tables to another segment instance will be performed. This might prove to be ex-
pensive for the table with large volumes of data and might just work fine for smaller
tables.
In some cases a broadcast motion will be performed rather than a redistribute mo-
tion. In a broadcast motion, every segment instance performs a broadcast or sends
its own individual rows to all other segment instances. This will result in every seg-
ment instance having its own complete and local copy of the entire table. A broad-
cast motion may not be as optimal as a redistribute motion. Therefore, the optimizer
typically selects a broadcast motion only for very small tables. A broadcast motion is
not acceptable for large tables.
Following is the syntax for defining distribution strategy for a table in Greenplum:
Search WWH ::




Custom Search