Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

For users who prefer to use a programming language other than Java, there are

some other options. One option is to use the Hadoop Streaming API , which

allows the user to write and run Hadoop jobs with no direct knowledge of Java

[18]. However, knowledge of some other programming language, such as Python,

C, or Ruby, is necessary. Apache Hadoop provides the Hadoop-streaming.jar

file that accepts the HDFS paths for the input/output files and the paths for the

files that implement the map and reduce functionality.

Here are some important considerations when preparing and running a Hadoop

streaming job:

• Although the shuffle and sort output are provided to the reducer in key

sorted order, the reducer does not receive the corresponding values as a

list; rather, it receives individual key/value pairs. The reduce code has to

monitor for changes in the value of the key and appropriately handle the

new key.

• The map and reduce code must already be in an executable form, or the

necessary interpreter must already be installed on each worker node.

• The map and reduce code must already reside on each worker node, or the

location of the code must be provided when the job is submitted. In the

latter case, the code is copied to each worker node.

• Some functionality, such as a partitioner, still needs to be written in Java.

• The inputs and outputs are handled through stdin and stdout. Stderr is

also available to track the status of the tasks, implement counter

functionality, and report execution issues to the display [18].

• The streaming API may not perform as well as similar functionality

written in Java.

A second alternative is to use Hadoop pipes , a mechanism that uses compiled

C++ code for the map and reduced functionality. An advantage of using C++ is the

extensive numerical libraries available to include in the code [19].

To work directly with data in HDFS, one option is to use the C API (libhdfs) or

the Java API provided with Apache Hadoop. These APIs allow reads and writes to

HDFS data files outside the typical MapReduce paradigm [20]. Such an approach

may be useful when attempting to debug a MapReduce job by examining the input

data or when the objective is to transform the HDFS data prior to running a

MapReduce job.

Search WWH ::

Custom Search

Home