Data Warehouses and Hadoop Integration - Microsoft Big Data Solutions

Database Reference

In-Depth Information

in HDFS. Remember that the world of Hadoop is a Java-based world, and

somewhat unsurprisingly PDW is C# based.

The HDFS Bridge then uses Java to provide the native integration with

Hadoop. This layer is responsible for communication with the NameNode

and for identifying the range of bytes to read from or write into HDFS

residing on the data nodes. The next layer up in the HDFS Bridge stack is a

Java Native Interface (JNI), which provides managed C# to the rest of the

DMS and to the PDW Engine Service.

In Figure 10.7 , you can see that the HDFS uses the Java RecordReader or

RecordWriter interface to access the data in Hadoop. The RecordReader/

Writer is a pluggable element, which is what allows PDW to support

different HDFS file types.

Figure 10.7 HDFS bridge architecture and data flow

Polybase gets much of its power through its ability to parallelize data

transfer between the compute nodes of PDW and the data nodes of HDFS.

What is interesting about this is that Polybase achieves this transfer with

runtime-only execution information.

The PDW Engine uses the HDFS Bridge to speak with the NameNode when

handed a Polybase query. The information it receives is used to divide

the work up among the compute nodes as evenly as possible. The work is

apportioned, with 256KB buffers allocated to stream rows back to the DMS

on each compute node. The DMS continues to ingest these buffers from the

RecordReader interface until the file has been read.

Imposing Structure with External Tables

PDW uses a DDL concept called an external table to impose the structure

we require on the data held in the files of HDFS. The external table is really

Search WWH ::

Custom Search

Home