Putting It Together: MapReduce Data Pipelines - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

# Kick off the streaming MapReduce job using local mapper

# and reducer scripts. Note: Filepaths may differ from

# your implementation.

>hadoop jar $HADOOP_PATH/contrib/streaming/hadoop-*streaming*.jar \

-file $HOME/mapper.py -mapper $HOME/mapper.py \

-file $HOME/reducer.py -reducer $HOME/reducer.py \

-input /user/hduser/VS2010NATL.DETAILUS.PUB \

-output /user/hduser/output

Frameworks for Hadoop

The Hadoop streaming API can be a great way to write custom MapReduce jobs,

especially for single-step tasks. Once you start writing a lot of Hadoop streaming

scripts, you quickly start to realize that there are many scenarios in which a few simple

map and reduce steps will not be enough. Complex data transformations require addi-

tional steps that must be pipelined into the input of another.

The monthly-birth-count example shown previously is very simple. It demonstrates

a single processing step. What happens if you want to run two mappers at once and

then use a single reducer step to join the output? Building this type of processing pipe-

line is definitely possible, but the code can become unwieldy fast.

Fortunately, there are many open-source frameworks built on top of Hadoop's

streaming utility that help to address these challenges. These frameworks not only help

in managing the complexity of MapReduce jobs with useful features, but also can sim-

plify the amount of code needed to craft our processing pipelines.

One of these frameworks is an open-source Python module known as mrjob ,

which was created by Yelp to help speed up the process of building and running

MapReduce tasks on Hadoop clusters. Using mrjob is a natural progression from writ-

ing Hadoop streaming API jobs directly in Bash or Python. Just like a vanilla Hadoop

streaming script, mrjob can be used both locally for testing and with an existing

Hadoop cluster. Because Yelp does quite a lot of work in Amazon's EC2 environment,

mrjob is especially well suited to running jobs with Amazon's Elastic MapReduce

service.

Rewriting Our Hadoop Streaming Example Using mrjob

mrjob can be built and installed like any other Python module. In order to use it,

extend the MRJob class with a custom class defining a series of processing steps. For

one-step MapReduce jobs, simply define a mapper and reducer function within your

class and call the MRJob.run method, as in Listing 8.9. Like our previous Hadoop

streaming scripts, our mrjob script will read data from stdin and output data to stdout.

Search WWH ::

Custom Search

Home