Database Reference
In-Depth Information
Chapter 17. Hive
In “Information Platforms and the Rise of the Data Scientist,” [ 106 ] Jeff Hammerbacher de-
scribes Information Platforms as “the locus of their organization's efforts to ingest, process,
and generate information,” and how they “serve to accelerate the process of learning from
empirical data.”
One of the biggest ingredients in the Information Platform built by Jeff's team at Facebook
was Apache Hive , a framework for data warehousing on top of Hadoop. Hive grew from a
need to manage and learn from the huge volumes of data that Facebook was producing
every day from its burgeoning social network. After trying a few different systems, the
team chose Hadoop for storage and processing, since it was cost effective and met the
scalability requirements.
Hive was created to make it possible for analysts with strong SQL skills (but meager Java
programming skills) to run queries on the huge volumes of data that Facebook stored in
HDFS. Today, Hive is a successful Apache project used by many organizations as a
general-purpose, scalable data processing platform.
Of course, SQL isn't ideal for every big data problem — it's not a good fit for building
complex machine-learning algorithms, for example — but it's great for many analyses, and
it has the huge advantage of being very well known in the industry. What's more, SQL is
the lingua franca in business intelligence tools (ODBC is a common bridge, for example),
so Hive is well placed to integrate with these products.
This chapter is an introduction to using Hive. It assumes that you have working knowledge
of SQL and general database architecture; as we go through Hive's features, we'll often
compare them to the equivalent in a traditional RDBMS.
Search WWH ::




Custom Search