Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

It lists the files in the results directory on HDFS and dumps the last 10 lines of the results part file using the

Hadoop file system cat command and the Lunix tail command.

The script wordcount.sh runs the Map Reduce task by using the Map and Reduce Perl scripts:

[hadoop@hc1nn perl]$ cat wordcount.sh

01 #!/bin/bash

02

03 # Now run the Perl based word count

04

05 cd $HADOOP_PREFIX

06

07 hadoop jar contrib/streaming/hadoop-*streaming*.jar \

08 -file /home/hadoop/perl/mapper.pl \

09 -mapper /home/hadoop/perl/mapper.pl \

10 -file /home/hadoop/perl/reducer.pl \

11 -reducer /home/hadoop/perl/reducer.pl \

12 -input /user/hadoop/edgar/* \

13 -output /user/hadoop/perl/results_wc

The \ characters allow you to make your Hadoop command line more readable by breaking a single command

line over multiple lines. The -file options make a file executable within Hadoop. The -mapper and -reducer options

identify the Map and Reduce functions for the job. The -input option gives the path on HDFS to the input text data.

The -output option specifies where the job output will be placed on HDFS.

The Hadoop jar parameter allows the command line to specify which library file to use—in this case, the

streaming library. Using the last three scripts for cleaning, running, and outputting the results makes the Map Reduce

task quickly repeatable; you do not need to retype the commands! The output is a Map Reduce job, as shown below:

[hadoop@hc1nn perl]$ ./wordcount.sh

packageJobJar: [/home/hadoop/perl/mapper.pl, /home/hadoop/perl/reducer.pl, /app/hadoop/tmp/hadoop-

unjar5199336797215175827/] [] /tmp/streamjob5502063820605104626.jar tmpDir=null

14/06/20 13:35:56 INFO util.NativeCodeLoader: Loaded the native-hadoop library

14/06/20 13:35:56 INFO mapred.FileInputFormat: Total input paths to process : 5

14/06/20 13:35:57 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]

14/06/20 13:35:57 INFO streaming.StreamJob: Running job: job_201406201237_0010

14/06/20 13:35:57 INFO streaming.StreamJob: To kill this job, run:

14/06/20 13:35:57 INFO streaming.StreamJob: /usr/local/hadoop-1.2.1/libexec/../bin/hadoop job

-Dmapred.job.tracker=hc1nn:54311 -kill job_201406201237_0010

14/06/20 13:35:57 INFO streaming.StreamJob: Tracking URL: http://hc1nn:50030/jobdetails.

jsp?jobid=job_201406201237_0010

14/06/20 13:35:58 INFO streaming.StreamJob: map 0% reduce 0%

14/06/20 13:36:06 INFO streaming.StreamJob: map 20% reduce 0%

14/06/20 13:36:08 INFO streaming.StreamJob: map 60% reduce 0%

14/06/20 13:36:13 INFO streaming.StreamJob: map 100% reduce 0%

14/06/20 13:36:15 INFO streaming.StreamJob: map 100% reduce 33%

14/06/20 13:36:19 INFO streaming.StreamJob: map 100% reduce 100%

14/06/20 13:36:22 INFO streaming.StreamJob: Job complete: job_201406201237_0010

14/06/20 13:36:22 INFO streaming.StreamJob: Output: /user/hadoop/perl/results_wc

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home