Database Reference
In-Depth Information
10.2 The Hadoop Ecosystem
So far, this chapter has provided an overview of Apache Hadoop relative to its
implementation of HDFS and the MapReduce paradigm. Hadoop's popularity has
spawned proprietary and open source tools to make Apache Hadoop easier to use
and provide additional functionality and features. This portion of the chapter
examines the following Hadoop-related Apache projects:
Pig: Provides a high-level data-flow programming language
Hive: Provides SQL-like access
Mahout: Provides analytical tools
HBase: Provides real-time reads and writes
By masking the details necessary to develop a MapReduce program, Pig and Hive
each enable a developer to write high-level code that is later translated into one or
more MapReduce programs. Because MapReduce is intended for batch processing,
Pig and Hive are also intended for batch processing use cases.
Once Hadoop processes a dataset, Mahout provides several tools that can analyze
the data in a Hadoop environment. For example, a k-means clustering analysis, as
described in Chapter 4, can be conducted using Mahout.
Differentiating itself from Pig and Hive batch processing, HBase provides the ability
to perform real-time reads and writes of data stored in a Hadoop environment. This
real-time access is accomplished partly by storing data in memory as well as in
HDFS. Also, HBase does not rely on MapReduce to access the HBase data. Because
the design and operation of HBase are significantly different from relational
databases and the other Hadoop tools examined, a detailed description of HBase
will be presented.
10.2.1 Pig
Apache Pig consists of a data flow language, Pig Latin, and an environment to
execute the Pig code. The main benefit of using Pig is to utilize the power of
MapReduce in a distributed system, while simplifying the tasks of developing and
executing a MapReduce job. In most cases, it is transparent to the user that a
MapReduce job is running in the background when Pig commands are executed.
This abstraction layer on top of Hadoop simplifies the development of code against
data in HDFS and makes MapReduce more accessible to a larger audience.
Search WWH ::




Custom Search