Database Reference
In-Depth Information
Hive
Apache Hive is another key subproject of Hadoop. It provides data
warehouse software that enables a SQL-like querying experience for the
end user. The Hive query language is called Hive Query Language (HQL).
(Clearly, the creators of Hive had no time for any kind of creative branding.)
HQL is similar to ANSI SQL, making the crossover from one to the other
relatively simple. HQL provides an abstraction over MapReduce; HQL
queries are translated by Hive into MapReduce jobs. Hive is therefore quite
a popular starting point for end users because there is no need to learn how
to program a MapReduce job to access and process data held in Hadoop.
It is important to understand that Hive does not turn Hadoop into a
relational database management system (RDBMS). Hive is still a
batch-processing system that generates MapReduce jobs. It does not offer
transactional support, a full type system, security, high concurrency, or
predictable response times. Queries tend to be measured in minutes rather
in than milliseconds or seconds. This is because there is a high spin-up cost
for each query and, at the end of the day, no cost-based optimizer underpins
the query plan like traditional SQL developers are used to. Therefore, it is
important not to overstate Hive's capabilities.
Hive does offer certain features that an RDBMS might not, though. For
example, Hive supports the following complex types: structs, maps (key/
value pairs), and arrays. Likewise, Hive offers native operator support for
regular expressions, which is an interesting addition. HQL also offers
additional extensibility by allowing MapReduce developers to plug in their
own custom mappers and reducers, allowing for more advanced analysis.
The most recent and exciting developments for Hive have been the new
Stinger initiatives. Stinger has the goal of delivering 100X performance
improvement to Hive plus SQL compatibility. These two features will have a
profound impact on Hadoop adoption; keep them on your radar. We'll talk
more about Stinger in Chapter 2, “Microsoft's Approach to Big Data.”
Pig
Apache Pig is an openly extensible programmable platform for loading,
manipulating, and transforming data in Hadoop using a scripting language
called Pig Latin. Pig is another abstraction on top of the Hadoop core.
It converts the Pig Latin script into MapReduce jobs, which can then be
executed against Hadoop.
Search WWH ::




Custom Search