Database Reference
In-Depth Information
WHERE d.CalendarYear_nmbr = 2012
GROUP BY p.Category_name
, d.CalendarMonth_name
In the preceding code, I am able to query data held in Hadoop and join
it to data held in PDW with a single logical declarative statement. This
is important because it enables consumers of data to work with data
irrespective of the data source. They are just tables of data.
That said, project Polybase is not without its restrictions. In the
release-to-manufacturing (RTM) version of PDW 2012, Polybase only
currently supports a delimited file format and works with a limited number
of distributions: HDP on Windows, HDP on Linux, and Cloudera.
Furthermore, the RTM version does not leverage any of the compute
resources of the Hadoop cluster. PDW is simply importing (at great speed)
all the data held in Hadoop and holding it in temporary tables inside PDW.
This model will evolve over time, and we should look forward in the future
to the automatic generation of MapReduce jobs as query optimizations. One
might imagine that a slightly rewritten query like the following one might
trigger a MapReduce job to enable the where clause to be applied to the
query as part of the Hadoop subtree of the query plan:
SELECT COUNT(*)
, SUM(s.Value) AS Total_Sales
, p.Category_name
, d.CalendarMonth_name
FROM dbo.hdfs_Sales s
JOIN dbo.pdw_Product p ON
s.Product_key = p.Product_key
JOIN dbo.pdw_Date d ON
s.Date_key = p.Date_key
WHERE s.Date_key >= 20120101
AND s.Date_key < 20130101
GROUP BY p.Category_name
, d.CalendarMonth_name
Exciting times lie ahead for PDW with Polybase integration into Hadoop.
We will dive into PDW and Polybase in much greater detail in Chapter 10,
“Data Warehouses and Hadoop Integration.”
Search WWH ::




Custom Search