Database Reference
In-Depth Information
to choose a hash join, for example. All these techniques are great tuning
options for data in your data warehouse.
Hadoop and HDFS do not even hold this basic information. Its approach to
data analysis is more sledgehammer based. Remember the Hadoop mindset
is to throw compute at the problem.
However, with a system such as Polybase, you really need the statistical
information, and a whole lot more. Therefore, in addition to holding
statistical information about data held in Hadoop, the Polybase engineers
need access to additional information to determine the optimal plan.
Consider this list for starters:
• Hadoop cluster size
• Network bandwidth to Hadoop cluster
• Utilization of resources on Hadoop
• Proximity of Hadoop cluster
• Selectivity of predicates for data held in Hadoop
• Semantic differences between Java and SQL
As Polybase evolves and both Hadoop and PDW mature, you could see some
really interesting decisions being made. Is 90% of the data in Hadoop, for
example? One answer could be to send the remaining data over to Hadoop
and process the query there. Is Hadoop busy and data volume reasonable?
Move the data to PDW. I am no query processor expert (far from it), but
these kinds of possibilities are exciting!
Why Poly in Polybase?
If the sole objective of Polybase was to integrate with just HDFS, why call
it Polybase? If we look at the definition of poly , which is “more than one;
many or much” (according to the Collins Dictionary), is this not a clear
signal of bigger things to come? Speaking personally for a moment, I'd love
to be able to simply reference delimited files exposed on a Windows NTFS
file system. I think that would be a really nice extension of this feature and
would significantly strengthen PDW's data export functionality. However, I
am sure that the cloud will factor into Polybase somewhere. Every product
in the Microsoft Data Platform has to have a cloud strategy, and PDW is no
Search WWH ::




Custom Search