Putting It Together: MapReduce Data Pipelines - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

if __name__ == "__main__":

main()

Testing the MapReduce Pipeline Locally

Because the Hadoop streaming API enables us to use any script that accepts input

through stdin, we can easily test our MapReduce scripts on a single machine before

farming them out to our Hadoop cluster. It is useful in these situations to try your

MapReduce job on a small sample of your source data in order to sanity check the

workf low. In the example that follows, I've simply used the Unix head command to

slice off the first hundred thousand records into a new, smaller file called birth_data_

sample.txt .

In order to use our mapper and reducer scripts with Hadoop streaming, we will

also need to make sure that they will be treated as executable files. Use the chmod

command to let the system know that the scripts should be treated as executable (see

Listing 8.7).

We are ready to test our pipeline. The output of the mapper script will be piped

into the reducer script as input. Between the mapper and the reducer step, we will sort

the values by key. This simulates the shuff le sorting that is handled for us on a Hadoop

cluster.

Listing 8.7 Running our mapper.py and reducer.py scripts locally on test data

# Create a small, 75 MB sample of the 2012 birth dataset

> head -n 100000 VS2010NATL.DETAILUS.PUB > birth_data_sample.txt

# Make sure that the mapper.py and reducer.py scripts are executable

> chmod +x mapper.py

> chmod +x reducer.py

# First, test the mapper alone

> cat birth_data_sample.txt | mapper.py

2010-09 1

2010-10 1

2010-08 1

2010-07 1

2010-10 1

# etc...

# Now pipe the sorted results of the mapper into the reducer step

> cat birth_data_sample.txt | ./mapper.py | sort | ./reducer.py

2010-01:8701

2010-02:8155

2010-03:8976

2010-04:8521

Search WWH ::

Custom Search

Home