Industry Needs and Solutions - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Hive

Apache Hive is another key subproject of Hadoop. It provides data

warehouse software that enables a SQL-like querying experience for the

end user. The Hive query language is called Hive Query Language (HQL).

(Clearly, the creators of Hive had no time for any kind of creative branding.)

HQL is similar to ANSI SQL, making the crossover from one to the other

relatively simple. HQL provides an abstraction over MapReduce; HQL

queries are translated by Hive into MapReduce jobs. Hive is therefore quite

a popular starting point for end users because there is no need to learn how

to program a MapReduce job to access and process data held in Hadoop.

It is important to understand that Hive does not turn Hadoop into a

relational database management system (RDBMS). Hive is still a

batch-processing system that generates MapReduce jobs. It does not offer

transactional support, a full type system, security, high concurrency, or

predictable response times. Queries tend to be measured in minutes rather

in than milliseconds or seconds. This is because there is a high spin-up cost

for each query and, at the end of the day, no cost-based optimizer underpins

the query plan like traditional SQL developers are used to. Therefore, it is

important not to overstate Hive's capabilities.

Hive does offer certain features that an RDBMS might not, though. For

example, Hive supports the following complex types: structs, maps (key/

value pairs), and arrays. Likewise, Hive offers native operator support for

regular expressions, which is an interesting addition. HQL also offers

additional extensibility by allowing MapReduce developers to plug in their

own custom mappers and reducers, allowing for more advanced analysis.

The most recent and exciting developments for Hive have been the new

Stinger initiatives. Stinger has the goal of delivering 100X performance

improvement to Hive plus SQL compatibility. These two features will have a

profound impact on Hadoop adoption; keep them on your radar. We'll talk

more about Stinger in Chapter 2, “Microsoft's Approach to Big Data.”

Pig

Apache Pig is an openly extensible programmable platform for loading,

manipulating, and transforming data in Hadoop using a scripting language

called Pig Latin. Pig is another abstraction on top of the Hadoop core.

It converts the Pig Latin script into MapReduce jobs, which can then be

executed against Hadoop.

Search WWH ::

Custom Search

Home