Database Reference
In-Depth Information
perform an aggregation that is not distribution compatible (we do not see
the distribution key in the group by); therefore, it has to shuffle the data
again before moving to its ReturnOperation .
However, if we look at the read query plan steps immediately preceding
their respective Hadoop DMS operations, we can see that both read queries
perform the same steps as the import. The only difference is that import
must also create the table FactInternetSales_Import and check that
the user has the permissions to create a table, which is what it is doing in
steps 0 and 1 of its plan in Figure 10.9 . Otherwise, these plans are the same.
Owing to the very loose coupling offered by Polybase, it may make sense to
move the data into PDW via a CTAS import before executing any further
queries. This affords you greater consistency because you will now be
querying “your” copy of the data. Now you can guarantee that you are
the only person able to edit the data if you so wish. Although this is a
disconnected data set from the source, this does have a second added
benefit. By moving the data into PDW first, you ensure that future queries
on this data will execute more quickly because: a) you will have already
imported the data and so won't be paying that write penalty on every
execution and b) you can optimize subsequent queries by creating statistics
on important columns used by subsequent queries.
Exporting Data with CETAS
PDW uses the same basic mechanism for exporting data as it does for
importing. The transfer of data is native and parallel in nature, so we can
push the data out very simply, efficiently, and effectively. Ideally, the
Hadoop cluster would be attached to the PDW's Infiniband network for
maximum performance (although, this would not be a cheap option). A
more realistic option is to use 10-Gigabit Ethernet (10GbE).
To export the data, we use a variation on the CREATE TABLE AS SELECT
methodseenwhenimportingdata.Thistimeweexportdatabyfirstcreating
an external table and then pushing the data through it. This provides us
with all the metadata we need to identify the location of the cluster and the
format of the delimited file in HDFS. Consequently, the DDL for this action
is CREATE EXTERNAL TABLE AS SELECT (CETAS). See the following
code for a simple example:
Search WWH ::




Custom Search