Data Warehouses and Hadoop Integration - Microsoft Big Data Solutions

Database Reference

In-Depth Information

to choose a hash join, for example. All these techniques are great tuning

options for data in your data warehouse.

Hadoop and HDFS do not even hold this basic information. Its approach to

data analysis is more sledgehammer based. Remember the Hadoop mindset

is to throw compute at the problem.

However, with a system such as Polybase, you really need the statistical

information, and a whole lot more. Therefore, in addition to holding

statistical information about data held in Hadoop, the Polybase engineers

need access to additional information to determine the optimal plan.

Consider this list for starters:

• Hadoop cluster size

• Network bandwidth to Hadoop cluster

• Utilization of resources on Hadoop

• Proximity of Hadoop cluster

• Selectivity of predicates for data held in Hadoop

• Semantic differences between Java and SQL

As Polybase evolves and both Hadoop and PDW mature, you could see some

really interesting decisions being made. Is 90% of the data in Hadoop, for

example? One answer could be to send the remaining data over to Hadoop

and process the query there. Is Hadoop busy and data volume reasonable?

Move the data to PDW. I am no query processor expert (far from it), but

these kinds of possibilities are exciting!

Why Poly in Polybase?

If the sole objective of Polybase was to integrate with just HDFS, why call

it Polybase? If we look at the definition of poly , which is “more than one;

many or much” (according to the Collins Dictionary), is this not a clear

signal of bigger things to come? Speaking personally for a moment, I'd love

to be able to simply reference delimited files exposed on a Windows NTFS

file system. I think that would be a really nice extension of this feature and

would significantly strengthen PDW's data export functionality. However, I

am sure that the cloud will factor into Polybase somewhere. Every product

in the Microsoft Data Platform has to have a cloud strategy, and PDW is no

Search WWH ::

Custom Search

Home