Database Reference
In-Depth Information
# Kick off the streaming MapReduce job using local mapper
# and reducer scripts. Note: Filepaths may differ from
# your implementation.
>hadoop jar $HADOOP_PATH/contrib/streaming/hadoop-*streaming*.jar \
-file $HOME/mapper.py -mapper $HOME/mapper.py \
-file $HOME/reducer.py -reducer $HOME/reducer.py \
-input /user/hduser/VS2010NATL.DETAILUS.PUB \
-output /user/hduser/output
Managing Complexity: Python MapReduce
Frameworks for Hadoop
The Hadoop streaming API can be a great way to write custom MapReduce jobs,
especially for single-step tasks. Once you start writing a lot of Hadoop streaming
scripts, you quickly start to realize that there are many scenarios in which a few simple
map and reduce steps will not be enough. Complex data transformations require addi-
tional steps that must be pipelined into the input of another.
The monthly-birth-count example shown previously is very simple. It demonstrates
a single processing step. What happens if you want to run two mappers at once and
then use a single reducer step to join the output? Building this type of processing pipe-
line is definitely possible, but the code can become unwieldy fast.
Fortunately, there are many open-source frameworks built on top of Hadoop's
streaming utility that help to address these challenges. These frameworks not only help
in managing the complexity of MapReduce jobs with useful features, but also can sim-
plify the amount of code needed to craft our processing pipelines.
One of these frameworks is an open-source Python module known as mrjob ,
which was created by Yelp to help speed up the process of building and running
MapReduce tasks on Hadoop clusters. Using mrjob is a natural progression from writ-
ing Hadoop streaming API jobs directly in Bash or Python. Just like a vanilla Hadoop
streaming script, mrjob can be used both locally for testing and with an existing
Hadoop cluster. Because Yelp does quite a lot of work in Amazon's EC2 environment,
mrjob is especially well suited to running jobs with Amazon's Elastic MapReduce
service.
Rewriting Our Hadoop Streaming Example Using mrjob
mrjob can be built and installed like any other Python module. In order to use it,
extend the MRJob class with a custom class defining a series of processing steps. For
one-step MapReduce jobs, simply define a mapper and reducer function within your
class and call the MRJob.run method, as in Listing 8.9. Like our previous Hadoop
streaming scripts, our mrjob script will read data from stdin and output data to stdout.
 
 
Search WWH ::




Custom Search