Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Okay, that works. The input of five words separated by spaces is outputted as five key-value pairs of the words

with a value of 1. Now, you test the Reduce function with test2.sh:

[hadoop@hc1nn perl]$ cat test2.sh

01 #!/bin/bash

02

03 # test the mapper

04

05 echo "one one one two three" | ./mapper.pl | ./reducer.pl

This script pipes the output from the Map function shown above into the Reduce function:

[hadoop@hc1nn perl]$ ./test2.sh

one,3

two,1

three,1

The Reduce function sums the values of the similar words correctly: three instances of the word one followed by

one each of two and three . Now, it is time to run the Hadoop streaming Map Reduce job by using these Perl scripts.

You create three scripts to help with this:

[hadoop@hc1nn perl]$ ls w*

wc_clean.sh wc_output.sh wordcount.sh

The script wc_clean.sh is used to delete the contents of the results directory on HDFS so that the Map Reduce job

can be rerun:

[hadoop@hc1nn perl]$ cat wc_clean.sh

01 #!/bin/bash

02

03 # Clean the hadoop perl run data directory

04

05 hadoop dfs -rmr /user/hadoop/perl/results_wc

This uses the Hadoop file system rmr command to delete the directory and its contents.

The script wc_output.sh is used to display the results of the job:

[hadoop@hc1nn perl]$ cat wc_output.sh

01 #!/bin/bash

02

03 # List the results directory

04

05 hadoop dfs -ls /user/hadoop/perl/results_wc

06

07 # Cat the last ten lines of the part file

08

09 hadoop dfs -cat /user/hadoop/perl/results_wc/part-00000 | tail -10

Search WWH ::

Custom Search

Home