Database Reference
In-Depth Information
into three phases. Phase 1 is what we have in PDW today, so I won't belabor
that. Phase 2 was really focused on using the computational resources of
the Hadoop cluster; selectively issuing MapReduce jobs to optimize queries
instead of relying on the parallel import functionality that PDW has today.
Phase 3 was much heavily caveated. Therefore, it's difficult to read too much
into it. However, it was clear that a lot of thought is going into leveraging
Hadoop 2.0, Yarn, and Tez. Couldwesee a more holistic optimizer? Perhaps
we'll see PDW dynamically exporting data to Hadoop because it might be
more efficient to process the data there for example. Who knows? Suddenly
the crystal ball looks a little bit too blurry.
Let's step back to phase 2 then. While highlighting the opportunity for
leveraging Hadoop's compute resources, Dr.DeWitt alsohighlighted several
challenges that made me realize that this wasn't simply a case of just getting
PDW to issue MapReduce jobs every time a Polybase query is received.
In order to decide where and when to use MapReduce, PDW would need
additional information fed into it so it could make an informed decision to
use it or not. However, put simply, Hadoop did not hold the kind of data
required by PDW. It was clearly not going to be a trivial task deriving this
information either. We will discuss the kinds of data PDW shortly but first
let's understand a bit more about how PDW optimizes a query.
Rather than use a rules engine, PDW uses a cost-based optimizer to
determine how to resolve your query in the most efficient manner. Like SQL
Server, we create statistics on key columns to help guide the optimizer. In
PDW, we create stats on all columns involved in the following areas of a
query:
• Joins
• Group by
• Where clause
• Order by
Having these statistical objects created on our tables helps PDW decide on
the best MPP plan for your query. In some cases, especially for composite
joins, we may even create multicolumn stats to help guide the optimizer
further. A multicolumn statistic provides the density of the neighboring
column(s) to the optimizer, which can help to influence the type of join
chosen. Instead of a nested loop join, which is often undesirable and
inefficient in a data warehouse, we might be able to influence the optimizer
Search WWH ::




Custom Search