Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Chapter 17. Hive

In “Information Platforms and the Rise of the Data Scientist,” [ 106 ] Jeff Hammerbacher de-

scribes Information Platforms as “the locus of their organization's efforts to ingest, process,

and generate information,” and how they “serve to accelerate the process of learning from

empirical data.”

One of the biggest ingredients in the Information Platform built by Jeff's team at Facebook

was Apache Hive , a framework for data warehousing on top of Hadoop. Hive grew from a

need to manage and learn from the huge volumes of data that Facebook was producing

every day from its burgeoning social network. After trying a few different systems, the

team chose Hadoop for storage and processing, since it was cost effective and met the

scalability requirements.

Hive was created to make it possible for analysts with strong SQL skills (but meager Java

programming skills) to run queries on the huge volumes of data that Facebook stored in

HDFS. Today, Hive is a successful Apache project used by many organizations as a

general-purpose, scalable data processing platform.

Of course, SQL isn't ideal for every big data problem — it's not a good fit for building

complex machine-learning algorithms, for example — but it's great for many analyses, and

it has the huge advantage of being very well known in the industry. What's more, SQL is

the lingua franca in business intelligence tools (ODBC is a common bridge, for example),

so Hive is well placed to integrate with these products.

This chapter is an introduction to using Hive. It assumes that you have working knowledge

of SQL and general database architecture; as we go through Hive's features, we'll often

compare them to the equivalent in a traditional RDBMS.

Search WWH ::

Custom Search

Home