Putting It Together: MapReduce Data Pipelines - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

configuration file stored in the user's home directory. Once this is set up, kick off your

MapReduce job by running your mrjob script with the -r f lag set to emr . It is also

possible to specify the number and type of EC2 instances to use for your MapReduce

job. See Listing 8.13 for an example.

Listing 8.13 Using mrjobs with Elastic MapReduce

# Set the Access Key ID and Secret Access Key Environment Variables

> export AWS_ACCESS_KEY_ID=XXXACCESSKEYHEREXXX

> export AWS_SECRET_ACCESS_KEY=XXXSECRETKEYHEREXXX

# Start an Elastic MapReduce job with 4 small instances

> python your_mr_job_sub_class.py -r emr \

--ec2_instance_type c1.small --num-ec2-instances 4

Alternative Python-Based MapReduce Frameworks

mrjob is not the only Python-based MapReduce framework for Hadoop streaming.

Another popular framework is Dumbo . Dumbo is similar to mrjob in many ways,

and for simple MapReduce tasks, the structure of a Dumbo script is fairly similar

to one using mrjob. In my opinion, one of the strongest differentiators is in mrjob's

EMR integration. Because Yelp runs a lot of data processing tasks in the Amazon

cloud, kicking off a job on Elastic MapReduce is a bit easier using mrjob than it is

when using Dumbo. To its credit, Dumbo is probably a bit easier to use if you need to

work with some type of custom data-input format.

Both mrjob and Dumbo are wrappers for the Hadoop streaming API, but they

don't attempt to provide direct access to all the classes and methods found in the

lower-level Hadoop Java API. For even more fine-grained access to these functions,

it might be worth taking a look at Pydoop. Pydoop's goal is to be a Python interface

for Hadoop and HDFS itself. By sidestepping Hadoop's streaming API, it may be more

performant to use Pydoop over frameworks such as mrjob or Dumbo. If you are ana-

lyzing terabyte datasets, a 2X boost in performance might mean huge savings in over-

all processing time.

Summary

Dealing with the collection, processing, and analysis of large amounts of data requires

specialized tools for each step. To solve this problem, it is necessary to build data pro-

cessing pipelines to transform data from one state to another. Hadoop helps to make

distribution of our pipeline task across a large number of machines accessible—but

writing custom MapReduce jobs using the standard Hadoop API can be challenging.

Many common, large-scale data processing tasks can be addressed using the Hadoop

streaming API. Hadoop streaming scripts are a great solution for single-step jobs

Search WWH ::

Custom Search

Home