Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

28 if ( $word eq $oldword )

29 {

30 $sumval += $value ;

31 }

32 else

33 {

34 if ( $oldword ne "" )

35 {

36 print "$oldword,$sumval\n" ;

37 }

38 $sumval = 1 ;

39 }

40

41 # now print the name value pairs

42

43 $oldword = $word ;

44 }

45

46 # remember to print last word

47

48 print "$oldword,$sumval\n" ;

The reducer.pl Perl script that receives data from the mapper.pl script splits its STDIN (standard input) line into

the key-value pair of word,1 (at line 21). It then groups similar words and increments their count between lines 28 and

39. Lastly, it outputs key-value pairs as word,count at lines 36 and 48.

You already have some basic text files on HDFS under the directory /user/hadoop/edgar on which you can run the

Perl word-count example. Check the data using the Hadoop file system ls command to be sure that it is ready to use:

[hadoop@hc1nn python]$ hadoop dfs -ls /user/hadoop/edgar

Found 5 items

-rw-r--r-- 1 hadoop supergroup 410012 2014-06-15 15:53 /user/hadoop/edgar/10031.txt

-rw-r--r-- 1 hadoop supergroup 559352 2014-06-15 15:53 /user/hadoop/edgar/15143.txt

-rw-r--r-- 1 hadoop supergroup 66401 2014-06-15 15:53 /user/hadoop/edgar/17192.txt

-rw-r--r-- 1 hadoop supergroup 596736 2014-06-15 15:53 /user/hadoop/edgar/2149.txt

-rw-r--r-- 1 hadoop supergroup 63278 2014-06-15 15:53 /user/hadoop/edgar/932.txt

The test1.sh shell script tests the Map function on the Linux command line to ensure that it works, giving a single

word count—that is, a count of 1 for each word in the string:

[hadoop@hc1nn perl]$ cat test1.sh

01 #!/bin/bash

02

03 # test the mapper

04

05 echo "one one one two three" | ./mapper.pl

[hadoop@hc1nn perl]$ ./test1.sh

one,1

two,1

three,1

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home