Database Reference
In-Depth Information
if __name__ == "__main__":
main()
Testing the MapReduce Pipeline Locally
Because the Hadoop streaming API enables us to use any script that accepts input
through stdin, we can easily test our MapReduce scripts on a single machine before
farming them out to our Hadoop cluster. It is useful in these situations to try your
MapReduce job on a small sample of your source data in order to sanity check the
workf low. In the example that follows, I've simply used the Unix head command to
slice off the first hundred thousand records into a new, smaller file called birth_data_
sample.txt .
In order to use our mapper and reducer scripts with Hadoop streaming, we will
also need to make sure that they will be treated as executable files. Use the chmod
command to let the system know that the scripts should be treated as executable (see
Listing 8.7).
We are ready to test our pipeline. The output of the mapper script will be piped
into the reducer script as input. Between the mapper and the reducer step, we will sort
the values by key. This simulates the shuff le sorting that is handled for us on a Hadoop
cluster.
Listing 8.7 Running our mapper.py and reducer.py scripts locally on test data
# Create a small, 75 MB sample of the 2012 birth dataset
> head -n 100000 VS2010NATL.DETAILUS.PUB > birth_data_sample.txt
# Make sure that the mapper.py and reducer.py scripts are executable
> chmod +x mapper.py
> chmod +x reducer.py
# First, test the mapper alone
> cat birth_data_sample.txt | mapper.py
2010-09 1
2010-10 1
2010-08 1
2010-07 1
2010-10 1
# etc...
# Now pipe the sorted results of the mapper into the reducer step
> cat birth_data_sample.txt | ./mapper.py | sort | ./reducer.py
2010-01:8701
2010-02:8155
2010-03:8976
2010-04:8521
 
 
Search WWH ::




Custom Search