Data Warehouses and Hadoop Integration - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Figure 10.9 contains two other steps that are interesting:

• Step 11 - ExternalStatisticsOperation

• Step 12 - OnOperation

Althoughstep11is“empty”(effectivelyhidinganinternaloperation),wecan

infer what action the external statistics operation is performing by looking

at step 12. I've copied the code here:

UPDATE STATISTICS

[Instructor].[dbo].[HDFS_FactInternetSales]

WITH ROWCOUNT = [ROWCOUNT_TEMP_ID_246293]

, PAGECOUNT = [PAGECOUNT_TEMP_ID_246293]

Clearly, then, PDW is retrieving what statistical data it can from Hadoop

to determine the row length, number of rows in a process known as file

binding. The file blocks are then allocated across the compute nodes as

evenly as possible, for which we need to know the size of the table. This is

called the split generation. This step is clearly the first in a long series of

optimizations for the future phases of Polybase. Knowing the table size and

knowing the row count are important first steps to cost-based optimization

on Hadoop data.

Querying Across Relational and Nonrelational Data

“A single pane of glass,” that's what Polybase offers the business user—the

ability to write a single query that analyzes data across both the relational

data warehouse and the nonrelational data held in Hadoop. In that sense,

Polybase is a uniter of worlds. Another way of looking at it is that Polybase

is like a cow; it has many stomachs to digest data.

By leveraging the existence and structure of the external tables PDW is able

to simply write queries against data residing in HDFS.

Consider this simple example (see Figure 10.10 ) :

SELECT *

FROM dbo.HDFS_FactInternetSales FIS

OPTION

( LABEL = 'Polybase Read : Q001 :

HDFS_FactInternetSales'

Search WWH ::

Custom Search

Home