Data Warehouses and Hadoop Integration - Microsoft Big Data Solutions

Database Reference

In-Depth Information

into three phases. Phase 1 is what we have in PDW today, so I won't belabor

that. Phase 2 was really focused on using the computational resources of

the Hadoop cluster; selectively issuing MapReduce jobs to optimize queries

instead of relying on the parallel import functionality that PDW has today.

Phase 3 was much heavily caveated. Therefore, it's difficult to read too much

into it. However, it was clear that a lot of thought is going into leveraging

Hadoop 2.0, Yarn, and Tez. Couldwesee a more holistic optimizer? Perhaps

we'll see PDW dynamically exporting data to Hadoop because it might be

more efficient to process the data there for example. Who knows? Suddenly

the crystal ball looks a little bit too blurry.

Let's step back to phase 2 then. While highlighting the opportunity for

leveraging Hadoop's compute resources, Dr.DeWitt alsohighlighted several

challenges that made me realize that this wasn't simply a case of just getting

PDW to issue MapReduce jobs every time a Polybase query is received.

In order to decide where and when to use MapReduce, PDW would need

additional information fed into it so it could make an informed decision to

use it or not. However, put simply, Hadoop did not hold the kind of data

required by PDW. It was clearly not going to be a trivial task deriving this

information either. We will discuss the kinds of data PDW shortly but first

let's understand a bit more about how PDW optimizes a query.

Rather than use a rules engine, PDW uses a cost-based optimizer to

determine how to resolve your query in the most efficient manner. Like SQL

Server, we create statistics on key columns to help guide the optimizer. In

PDW, we create stats on all columns involved in the following areas of a

query:

• Joins

• Group by

• Where clause

• Order by

Having these statistical objects created on our tables helps PDW decide on

the best MPP plan for your query. In some cases, especially for composite

joins, we may even create multicolumn stats to help guide the optimizer

further. A multicolumn statistic provides the density of the neighboring

column(s) to the optimizer, which can help to influence the type of join

chosen. Instead of a nested loop join, which is often undesirable and

inefficient in a data warehouse, we might be able to influence the optimizer

Search WWH ::

Custom Search

Home