Databases Reference
In-Depth Information
GROUP BY token
;
The results should be the same as the output from “Example 4: Replicated Joins” in
Cascading. As you can see from this example, the expression of the last query is relatively
compact and easy to understand. Getting input data into Hive required a few backflips.
We didn't show the part about getting data out, but it's essentially an HDFS file, and
you'll need to manage your ETL process outside of Hive.
There are several advantages for using Hive:
• Hive is the most popular abstraction atop Apache Hadoop.
• Hive has a SQL-like language where the syntax is familiar for most analysts.
• Hive makes it simple to load large-scale unstructured data and run ad hoc queries.
• Hive provides many built-in functions for statistics, JSON, XPath, etc.
It is easy to understand on the surface, given that SQL is the lingua franca of Enterprise
data. However, a typical concern voiced in Enterprise IT environments is that while
Hive provides a SQL-like syntax, it is not compliant with the ANSI SQL spec. Hive's
behaviors contradict what people expect from SQL and relational databases. For exam‐
ple, nondeterministic execution of queries—particularly when Hive attempts to use
different join strategies—implies big surprises at scale during runtime.
Many years ago when Enterprise firms were considering SQL databases as new tech‐
nology, the predictability of runtime costs was a factor driving adoption. Although the
HQL of Hive is familiar as SQL, the predictability of runtime costs is not available.
Other issues found with Pig also apply to Hive:
• Integration generally requires code outside the scripting language.
• Business logic must cross multiple language boundaries.
• It becomes difficult to represent complex workflows, machine learning algorithms,
etc.
Again, all of this makes Hive increasingly difficult to troubleshoot, optimize, audit,
handle exceptions, set notifications, track data provenance, etc., for Enterprise data
workflows. Each “bug” may require hours or even days before its context can be repro‐
duced in a test environment. Complexity of the software grows, and so does the asso‐
ciated risk.
Search WWH ::




Custom Search