Putting It Together: MapReduce Data Pipelines - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Listing 8.9 A simple one-step MapReduce counter using mrjob

from mrjob.job import MRJob

class MRBirthCounter(MRJob):

# The mapper will read records from stdin

def mapper(self, key, record):

yield record[14:20], 1

def reducer(self, month, births):

# The reducer function yields a sum of the

# counts of each month's births.

yield month, sum(births)

if __name__ == '__main__':

MRBirthCounter.run()

Underneath the hood, the mrjob_simple_example.py script uses the same Hadoop

streaming API that we used earlier, meaning that it is still possible to test our code by

piping from stdin.

After we've tested our mrjob script on a local machine, let's run it on a Hadoop

cluster (Listing 8.10). This step is as easy as running the script like a normal Python

application, as long as we've set the HADOOP_HOME environment variable on the

machine in which our script lives. In order to do this, we will need to use the -r flag

to specify that we want to run the script on our Hadoop cluster, as opposed to run-

ning locally.

Listing 8.10 Testing and running mrjob_simple_example.py

# Run the mrjob script on our small sample data

> python mrjob_simple_example.py < birth_data_sample.txt

"201001" 8701

"201002" 8155

"201003" 8976

# etc...

# Run the mrjob script on an existing Hadoop cluster

# Ensure HADOOP_HOME environment variable is set

# (This setting may differ from your implementation)

> export HADOOP_HOME=/usr/local/hadoop-0.20.2/

# Specify that mrjob use Hadoop on the command line

> python mrjob_simple_example.py \

-r hadoop hdfs:///user/hduser/data/VS2010NATL.DETAILUS.PUB

Search WWH ::

Custom Search

Home