Database Reference
In-Depth Information
could only be handled by complex ETL and data warehouse tools. Hive can be a very
f lexible choice for aggregating organizational data, providing relatively low cost at the
expense of raw speed. This makes Hive the right choice for integration with existing
Hadoop installations. When data grows to extreme levels, such as the petabyte scales
experienced by Facebook, then distributed solutions such as Hive might be the only
viable way to provide economically feasible data warehouse functionality.
Hive in Practice
Although Hive's query language is meant to be familiar to users of relational databases,
there are many differences that stem from Hive's underlying Hadoop infrastructure. In
order to understand how Hive works, let's take a quick look at some of the basic con-
cepts behind Hadoop and MapReduce.
The Hadoop ecosystem contains a soup of terminology that can be confusing to
beginners. Hadoop itself is a framework for distributing data processing jobs across
many machines. Hadoop's MapReduce model for processing data is a three-step pro-
cess. In the first step, called the map phase , data is split into many shards, each of
which is identified by a particular key. The next phase, the shuffle sort , aggregates
data shards containing the same key on the same node in the cluster, allowing for data
processing to take place close to the actual data. Finally, in the reduce phase , the
shuffled data from individual nodes is crunched on local machines and output to pro-
duce a final result.
The MapReduce framework is a simple concept overall, but applying this process-
ing model to complex tasks can be tricky. A task such as a count of individual words
in terabytes of individual files might require just a single MapReduce job. However,
a complex aggregate query result from two tables, including mathematical operations
and joins of two different types of data, may require multiple MapReduce steps to
accomplish.
The Hadoop Distributed File System (HDFS ) provides an abstract interface for
distributing data files across a cluster of machines. Users don't have to know exactly
which data is available on a particular node in the Hadoop cluster. Moving data to a
location at which it will be processed is an expensive task, requiring data to be sent
over a network and potentially creating a performance bottleneck. Instead of mov-
ing data from storage nodes to processing nodes, HDFS helps nodes process data that
is on a particular machine. This design choice makes Hadoop efficient for large data
processing jobs. HDFS also provides fault tolerance through replications: If a node in
the Hadoop cluster goes down, the data will still be available somewhere else in the
cluster.
HDFS is designed to facilitate batch processing of huge amounts of data, but it's not
meant to be a database by any means. This means that in order for Hive to keep track
of data sources, it must provide a database-like structure on top of files initially con-
tained in HDFS. In order to keep track of this database structure, Hive uses a database
of its own, known as the metastore.
 
Search WWH ::




Custom Search