Database Reference
In-Depth Information
For users who prefer to use a programming language other than Java, there are
some other options. One option is to use the Hadoop Streaming API , which
allows the user to write and run Hadoop jobs with no direct knowledge of Java
[18]. However, knowledge of some other programming language, such as Python,
C, or Ruby, is necessary. Apache Hadoop provides the Hadoop-streaming.jar
file that accepts the HDFS paths for the input/output files and the paths for the
files that implement the map and reduce functionality.
Here are some important considerations when preparing and running a Hadoop
streaming job:
• Although the shuffle and sort output are provided to the reducer in key
sorted order, the reducer does not receive the corresponding values as a
list; rather, it receives individual key/value pairs. The reduce code has to
monitor for changes in the value of the key and appropriately handle the
new key.
• The map and reduce code must already be in an executable form, or the
necessary interpreter must already be installed on each worker node.
• The map and reduce code must already reside on each worker node, or the
location of the code must be provided when the job is submitted. In the
latter case, the code is copied to each worker node.
• Some functionality, such as a partitioner, still needs to be written in Java.
• The inputs and outputs are handled through stdin and stdout. Stderr is
also available to track the status of the tasks, implement counter
functionality, and report execution issues to the display [18].
• The streaming API may not perform as well as similar functionality
written in Java.
A second alternative is to use Hadoop pipes , a mechanism that uses compiled
C++ code for the map and reduced functionality. An advantage of using C++ is the
extensive numerical libraries available to include in the code [19].
To work directly with data in HDFS, one option is to use the C API (libhdfs) or
the Java API provided with Apache Hadoop. These APIs allow reads and writes to
HDFS data files outside the typical MapReduce paradigm [20]. Such an approach
may be useful when attempting to debug a MapReduce job by examining the input
data or when the objective is to transform the HDFS data prior to running a
MapReduce job.
Search WWH ::




Custom Search