Database Reference
In-Depth Information
Listing 8.9 A simple one-step MapReduce counter using mrjob
from mrjob.job import MRJob
class MRBirthCounter(MRJob):
# The mapper will read records from stdin
def mapper(self, key, record):
yield record[14:20], 1
def reducer(self, month, births):
# The reducer function yields a sum of the
# counts of each month's births.
yield month, sum(births)
if __name__ == '__main__':
MRBirthCounter.run()
Underneath the hood, the mrjob_simple_example.py script uses the same Hadoop
streaming API that we used earlier, meaning that it is still possible to test our code by
piping from stdin.
After we've tested our mrjob script on a local machine, let's run it on a Hadoop
cluster (Listing 8.10). This step is as easy as running the script like a normal Python
application, as long as we've set the HADOOP_HOME environment variable on the
machine in which our script lives. In order to do this, we will need to use the -r flag
to specify that we want to run the script on our Hadoop cluster, as opposed to run-
ning locally.
Listing 8.10 Testing and running mrjob_simple_example.py
# Run the mrjob script on our small sample data
> python mrjob_simple_example.py < birth_data_sample.txt
"201001" 8701
"201002" 8155
"201003" 8976
# etc...
# Run the mrjob script on an existing Hadoop cluster
# Ensure HADOOP_HOME environment variable is set
# (This setting may differ from your implementation)
> export HADOOP_HOME=/usr/local/hadoop-0.20.2/
# Specify that mrjob use Hadoop on the command line
> python mrjob_simple_example.py \
-r hadoop hdfs:///user/hduser/data/VS2010NATL.DETAILUS.PUB
 
 
Search WWH ::




Custom Search