Database Reference
In-Depth Information
Apache Hive: Interactive Querying for Hadoop
In order to apply the ability to run SQL-like queries using Hadoop's MapReduce
framework, Facebook needed to build software that could both manage database-like
structures and translate queries into multistage MapReduce jobs. Facebook's solution
was to create a data warehousing tool known as Hive. On the surface, Hive looks a
bit like typical data warehousing solutions based on relational databases. However, it
provides a number of advantages that especially apply to the challenges that Facebook
was encountering. First of all, Hive is able to use the underlying Hadoop framework
to more or less scale indefinitely as data sizes grow. Hive is also extensible; because it is
based on Hadoop, you can simply write new user-defined functions (or UDFs) using
the same MapReduce framework used by Hive.
Use Cases for Hive
As happens time and time again, the use cases for Apache Hive overlap those appropri-
ate for other technology solutions. When is it best to use Hive? For some users, Hive
can be an inexpensive and f lexible alternative to commercial data warehousing solu-
tions. Depending on the use case, Hive can make it possible to skip building compli-
cated ETL pipelines for data processing, which simplifies data analysis tremendously.
Thanks to the underlying Hadoop framework, Hive is also able to scale well as data
sizes grow large.
Despite the ability to scale across many machines and the presence of an SQL-like
query language, Hive is not meant to be used as the database backend of a high-traffic
system (sometimes known as an “operational” data store). MapReduce is a powerful
processing concept, but it's designed to provide a f lexible and programmable interface
to batch processing rather than raw speed. In many cases, Hive queries can be “fast
enough,” returning results over large datasets on the order of minutes. For already
structured data, Hive can return results fast enough to enable users to skip the ETL
steps necessary to build the star schemas for traditional data warehouses. If the solu-
tion to your data challenge requires even faster query results, it might make sense to
consider investing in the overhead of more traditional data warehousing software or to
add an analytical database to your data processing pipeline (see Chapter 6, “Building a
Data Dashboard with Google BigQuery”).
There's another reason why you might choose to go with the traditional data ware-
house over Hive: to take advantage of the robust sets of features available from the
more mature commercial market. For example, commercial data warehouses are often
built on top of hardware that provides features such as data disaster recovery. Hive
provides a much simpler set of features, and tasks such as data replication might require
more manual work.
At its core, Hive is neither a database nor a traditional data warehouse (although
it combines some features from both). Hive is really a tool for making some of the
advantages of the MapReduce framework available to address challenges that normally
 
 
Search WWH ::




Custom Search