Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

could only be handled by complex ETL and data warehouse tools. Hive can be a very

f lexible choice for aggregating organizational data, providing relatively low cost at the

expense of raw speed. This makes Hive the right choice for integration with existing

Hadoop installations. When data grows to extreme levels, such as the petabyte scales

experienced by Facebook, then distributed solutions such as Hive might be the only

viable way to provide economically feasible data warehouse functionality.

Hive in Practice

Although Hive's query language is meant to be familiar to users of relational databases,

there are many differences that stem from Hive's underlying Hadoop infrastructure. In

order to understand how Hive works, let's take a quick look at some of the basic con-

cepts behind Hadoop and MapReduce.

The Hadoop ecosystem contains a soup of terminology that can be confusing to

beginners. Hadoop itself is a framework for distributing data processing jobs across

many machines. Hadoop's MapReduce model for processing data is a three-step pro-

cess. In the first step, called the map phase , data is split into many shards, each of

which is identified by a particular key. The next phase, the shuffle sort , aggregates

data shards containing the same key on the same node in the cluster, allowing for data

processing to take place close to the actual data. Finally, in the reduce phase , the

shuffled data from individual nodes is crunched on local machines and output to pro-

duce a final result.

The MapReduce framework is a simple concept overall, but applying this process-

ing model to complex tasks can be tricky. A task such as a count of individual words

in terabytes of individual files might require just a single MapReduce job. However,

a complex aggregate query result from two tables, including mathematical operations

and joins of two different types of data, may require multiple MapReduce steps to

accomplish.

The Hadoop Distributed File System (HDFS ) provides an abstract interface for

distributing data files across a cluster of machines. Users don't have to know exactly

which data is available on a particular node in the Hadoop cluster. Moving data to a

location at which it will be processed is an expensive task, requiring data to be sent

over a network and potentially creating a performance bottleneck. Instead of mov-

ing data from storage nodes to processing nodes, HDFS helps nodes process data that

is on a particular machine. This design choice makes Hadoop efficient for large data

processing jobs. HDFS also provides fault tolerance through replications: If a node in

the Hadoop cluster goes down, the data will still be available somewhere else in the

cluster.

HDFS is designed to facilitate batch processing of huge amounts of data, but it's not

meant to be a database by any means. This means that in order for Hive to keep track

of data sources, it must provide a database-like structure on top of files initially con-

tained in HDFS. In order to keep track of this database structure, Hive uses a database

of its own, known as the metastore.

Search WWH ::

Custom Search

Home