Database Reference
In-Depth Information
configuration file stored in the user's home directory. Once this is set up, kick off your
MapReduce job by running your mrjob script with the -r f lag set to emr . It is also
possible to specify the number and type of EC2 instances to use for your MapReduce
job. See Listing 8.13 for an example.
Listing 8.13 Using mrjobs with Elastic MapReduce
# Set the Access Key ID and Secret Access Key Environment Variables
> export AWS_ACCESS_KEY_ID=XXXACCESSKEYHEREXXX
> export AWS_SECRET_ACCESS_KEY=XXXSECRETKEYHEREXXX
# Start an Elastic MapReduce job with 4 small instances
> python your_mr_job_sub_class.py -r emr \
--ec2_instance_type c1.small --num-ec2-instances 4
Alternative Python-Based MapReduce Frameworks
mrjob is not the only Python-based MapReduce framework for Hadoop streaming.
Another popular framework is Dumbo . Dumbo is similar to mrjob in many ways,
and for simple MapReduce tasks, the structure of a Dumbo script is fairly similar
to one using mrjob. In my opinion, one of the strongest differentiators is in mrjob's
EMR integration. Because Yelp runs a lot of data processing tasks in the Amazon
cloud, kicking off a job on Elastic MapReduce is a bit easier using mrjob than it is
when using Dumbo. To its credit, Dumbo is probably a bit easier to use if you need to
work with some type of custom data-input format.
Both mrjob and Dumbo are wrappers for the Hadoop streaming API, but they
don't attempt to provide direct access to all the classes and methods found in the
lower-level Hadoop Java API. For even more fine-grained access to these functions,
it might be worth taking a look at Pydoop. Pydoop's goal is to be a Python interface
for Hadoop and HDFS itself. By sidestepping Hadoop's streaming API, it may be more
performant to use Pydoop over frameworks such as mrjob or Dumbo. If you are ana-
lyzing terabyte datasets, a 2X boost in performance might mean huge savings in over-
all processing time.
Summary
Dealing with the collection, processing, and analysis of large amounts of data requires
specialized tools for each step. To solve this problem, it is necessary to build data pro-
cessing pipelines to transform data from one state to another. Hadoop helps to make
distribution of our pipeline task across a large number of machines accessible—but
writing custom MapReduce jobs using the standard Hadoop API can be challenging.
Many common, large-scale data processing tasks can be addressed using the Hadoop
streaming API. Hadoop streaming scripts are a great solution for single-step jobs
 
 
 
Search WWH ::




Custom Search