Database Reference
In-Depth Information
Figure 10.9 contains two other steps that are interesting:
• Step 11 - ExternalStatisticsOperation
• Step 12 - OnOperation
Althoughstep11is“empty”(effectivelyhidinganinternaloperation),wecan
infer what action the external statistics operation is performing by looking
at step 12. I've copied the code here:
UPDATE STATISTICS
[Instructor].[dbo].[HDFS_FactInternetSales]
WITH ROWCOUNT = [ROWCOUNT_TEMP_ID_246293]
, PAGECOUNT = [PAGECOUNT_TEMP_ID_246293]
Clearly, then, PDW is retrieving what statistical data it can from Hadoop
to determine the row length, number of rows in a process known as file
binding. The file blocks are then allocated across the compute nodes as
evenly as possible, for which we need to know the size of the table. This is
called the split generation. This step is clearly the first in a long series of
optimizations for the future phases of Polybase. Knowing the table size and
knowing the row count are important first steps to cost-based optimization
on Hadoop data.
Querying Across Relational and Nonrelational Data
“A single pane of glass,” that's what Polybase offers the business user—the
ability to write a single query that analyzes data across both the relational
data warehouse and the nonrelational data held in Hadoop. In that sense,
Polybase is a uniter of worlds. Another way of looking at it is that Polybase
is like a cow; it has many stomachs to digest data.
By leveraging the existence and structure of the external tables PDW is able
to simply write queries against data residing in HDFS.
Consider this simple example (see Figure 10.10 ) :
SELECT *
FROM dbo.HDFS_FactInternetSales FIS
OPTION
( LABEL = 'Polybase Read : Q001 :
HDFS_FactInternetSales'
 
Search WWH ::




Custom Search