Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

10.2 The Hadoop Ecosystem

So far, this chapter has provided an overview of Apache Hadoop relative to its

implementation of HDFS and the MapReduce paradigm. Hadoop's popularity has

spawned proprietary and open source tools to make Apache Hadoop easier to use

and provide additional functionality and features. This portion of the chapter

examines the following Hadoop-related Apache projects:

• Pig: Provides a high-level data-flow programming language

• Hive: Provides SQL-like access

• Mahout: Provides analytical tools

• HBase: Provides real-time reads and writes

By masking the details necessary to develop a MapReduce program, Pig and Hive

each enable a developer to write high-level code that is later translated into one or

more MapReduce programs. Because MapReduce is intended for batch processing,

Pig and Hive are also intended for batch processing use cases.

Once Hadoop processes a dataset, Mahout provides several tools that can analyze

the data in a Hadoop environment. For example, a k-means clustering analysis, as

described in Chapter 4, can be conducted using Mahout.

Differentiating itself from Pig and Hive batch processing, HBase provides the ability

to perform real-time reads and writes of data stored in a Hadoop environment. This

real-time access is accomplished partly by storing data in memory as well as in

HDFS. Also, HBase does not rely on MapReduce to access the HBase data. Because

the design and operation of HBase are significantly different from relational

databases and the other Hadoop tools examined, a detailed description of HBase

will be presented.

10.2.1 Pig

Apache Pig consists of a data flow language, Pig Latin, and an environment to

execute the Pig code. The main benefit of using Pig is to utilize the power of

MapReduce in a distributed system, while simplifying the tasks of developing and

executing a MapReduce job. In most cases, it is transparent to the user that a

MapReduce job is running in the background when Pig commands are executed.

This abstraction layer on top of Hadoop simplifies the development of code against

data in HDFS and makes MapReduce more accessible to a larger audience.

Search WWH ::

Custom Search

Home