Database Reference
In-Depth Information
in HDFS. Remember that the world of Hadoop is a Java-based world, and
somewhat unsurprisingly PDW is C# based.
The HDFS Bridge then uses Java to provide the native integration with
Hadoop. This layer is responsible for communication with the NameNode
and for identifying the range of bytes to read from or write into HDFS
residing on the data nodes. The next layer up in the HDFS Bridge stack is a
Java Native Interface (JNI), which provides managed C# to the rest of the
DMS and to the PDW Engine Service.
In Figure 10.7 , you can see that the HDFS uses the Java RecordReader or
RecordWriter interface to access the data in Hadoop. The RecordReader/
Writer is a pluggable element, which is what allows PDW to support
different HDFS file types.
Figure 10.7 HDFS bridge architecture and data flow
Polybase gets much of its power through its ability to parallelize data
transfer between the compute nodes of PDW and the data nodes of HDFS.
What is interesting about this is that Polybase achieves this transfer with
runtime-only execution information.
The PDW Engine uses the HDFS Bridge to speak with the NameNode when
handed a Polybase query. The information it receives is used to divide
the work up among the compute nodes as evenly as possible. The work is
apportioned, with 256KB buffers allocated to stream rows back to the DMS
on each compute node. The DMS continues to ingest these buffers from the
RecordReader interface until the file has been read.
Imposing Structure with External Tables
PDW uses a DDL concept called an external table to impose the structure
we require on the data held in the files of HDFS. The external table is really
 
 
Search WWH ::




Custom Search