Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

In order to apply the ability to run SQL-like queries using Hadoop's MapReduce

framework, Facebook needed to build software that could both manage database-like

structures and translate queries into multistage MapReduce jobs. Facebook's solution

was to create a data warehousing tool known as Hive. On the surface, Hive looks a

bit like typical data warehousing solutions based on relational databases. However, it

provides a number of advantages that especially apply to the challenges that Facebook

was encountering. First of all, Hive is able to use the underlying Hadoop framework

to more or less scale indefinitely as data sizes grow. Hive is also extensible; because it is

based on Hadoop, you can simply write new user-defined functions (or UDFs) using

the same MapReduce framework used by Hive.

Use Cases for Hive

As happens time and time again, the use cases for Apache Hive overlap those appropri-

ate for other technology solutions. When is it best to use Hive? For some users, Hive

can be an inexpensive and f lexible alternative to commercial data warehousing solu-

tions. Depending on the use case, Hive can make it possible to skip building compli-

cated ETL pipelines for data processing, which simplifies data analysis tremendously.

Thanks to the underlying Hadoop framework, Hive is also able to scale well as data

sizes grow large.

Despite the ability to scale across many machines and the presence of an SQL-like

query language, Hive is not meant to be used as the database backend of a high-traffic

system (sometimes known as an “operational” data store). MapReduce is a powerful

processing concept, but it's designed to provide a f lexible and programmable interface

to batch processing rather than raw speed. In many cases, Hive queries can be “fast

enough,” returning results over large datasets on the order of minutes. For already

structured data, Hive can return results fast enough to enable users to skip the ETL

steps necessary to build the star schemas for traditional data warehouses. If the solu-

tion to your data challenge requires even faster query results, it might make sense to

consider investing in the overhead of more traditional data warehousing software or to

add an analytical database to your data processing pipeline (see Chapter 6, “Building a

Data Dashboard with Google BigQuery”).

There's another reason why you might choose to go with the traditional data ware-

house over Hive: to take advantage of the robust sets of features available from the

more mature commercial market. For example, commercial data warehouses are often

built on top of hardware that provides features such as data disaster recovery. Hive

provides a much simpler set of features, and tasks such as data replication might require

more manual work.

At its core, Hive is neither a database nor a traditional data warehouse (although

it combines some features from both). Hive is really a tool for making some of the

advantages of the MapReduce framework available to address challenges that normally

Search WWH ::

Custom Search

Home