Data Warehouses and Hadoop Integration - Microsoft Big Data Solutions

Database Reference

In-Depth Information

perform an aggregation that is not distribution compatible (we do not see

the distribution key in the group by); therefore, it has to shuffle the data

again before moving to its ReturnOperation .

However, if we look at the read query plan steps immediately preceding

their respective Hadoop DMS operations, we can see that both read queries

perform the same steps as the import. The only difference is that import

must also create the table FactInternetSales_Import and check that

the user has the permissions to create a table, which is what it is doing in

steps 0 and 1 of its plan in Figure 10.9 . Otherwise, these plans are the same.

Owing to the very loose coupling offered by Polybase, it may make sense to

move the data into PDW via a CTAS import before executing any further

queries. This affords you greater consistency because you will now be

querying “your” copy of the data. Now you can guarantee that you are

the only person able to edit the data if you so wish. Although this is a

disconnected data set from the source, this does have a second added

benefit. By moving the data into PDW first, you ensure that future queries

on this data will execute more quickly because: a) you will have already

imported the data and so won't be paying that write penalty on every

execution and b) you can optimize subsequent queries by creating statistics

on important columns used by subsequent queries.

Exporting Data with CETAS

PDW uses the same basic mechanism for exporting data as it does for

importing. The transfer of data is native and parallel in nature, so we can

push the data out very simply, efficiently, and effectively. Ideally, the

Hadoop cluster would be attached to the PDW's Infiniband network for

maximum performance (although, this would not be a cheap option). A

more realistic option is to use 10-Gigabit Ethernet (10GbE).

To export the data, we use a variation on the CREATE TABLE AS SELECT

methodseenwhenimportingdata.Thistimeweexportdatabyfirstcreating

an external table and then pushing the data through it. This provides us

with all the metadata we need to identify the location of the cluster and the

format of the delimited file in HDFS. Consequently, the DDL for this action

is CREATE EXTERNAL TABLE AS SELECT (CETAS). See the following

code for a simple example:

Search WWH ::

Custom Search

Home