Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Now you are ready to run this extended version of the Java Map Reduce task. The library that was just created is

specified via the Hadoop jar option. This is followed by the Class name to be called within that library. Next, a flag

is set via the -D option to switch the case sensitivity off. After that, the input data file and output directory names on

HDFS are listed. Finally, you specify a skip file to remove any unwanted characters in the data processed:

[hadoop@hc1nn wordcount]$ hadoop jar ./wordcount1.jar org.myorg.WordCount

-Dwordcount.case.sensitive=false /user/hadoop/edgar/10031.txt

/user/hadoop/edgar-results -skip /user/hadoop/java/patterns.txt

The command produces the following Map Reduce task output:

14/06/21 17:40:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library

14/06/21 17:40:06 INFO mapred.FileInputFormat: Total input paths to process : 1

14/06/21 17:40:07 INFO mapred.JobClient: Running job: job_201406211041_0004

14/06/21 17:40:08 INFO mapred.JobClient: map 0% reduce 0%

14/06/21 17:40:15 INFO mapred.JobClient: map 50% reduce 0%

14/06/21 17:40:23 INFO mapred.JobClient: map 100% reduce 16%

14/06/21 17:40:30 INFO mapred.JobClient: map 100% reduce 100%

14/06/21 17:40:31 INFO mapred.JobClient: Job complete: job_201406211041_0004

14/06/21 17:40:31 INFO mapred.JobClient: Counters: 32

14/06/21 17:40:31 INFO mapred.JobClient: Job Counters

14/06/21 17:40:31 INFO mapred.JobClient: Launched reduce tasks=1

14/06/21 17:40:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=17198

14/06/21 17:40:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving

slots (ms)=0

............................

14/06/21 17:40:31 INFO mapred.JobClient: CPU time spent (ms)=5880

14/06/21 17:40:31 INFO mapred.JobClient: Map input bytes=410012

14/06/21 17:40:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=198

14/06/21 17:40:31 INFO mapred.JobClient: Combine input records=63590

14/06/21 17:40:31 INFO mapred.JobClient: Reduce input records=12581

14/06/21 17:40:31 INFO mapred.JobClient: Reduce input groups=9941

14/06/21 17:40:31 INFO mapred.JobClient: Combine output records=12581

14/06/21 17:40:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=404115456

14/06/21 17:40:31 INFO mapred.JobClient: Reduce output records=9941

14/06/21 17:40:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4109373440

14/06/21 17:40:31 INFO mapred.JobClient: Map output records=63590

Check the results directory on HDFS by using the Hadoop file system ls command. The existence of a _SUCCESS

file shows that the job was a success:

[hadoop@hc1nn wordcount]$ hadoop dfs -ls /user/hadoop/edgar-results

Found 3 items

-rw-r--r-- 1 hadoop supergroup 0 2014-06-21 17:40 /user/hadoop/edgar-results/_SUCCESS

drwxr-xr-x - hadoop supergroup 0 2014-06-21 17:40 /user/hadoop/edgar-results/_logs

-rw-r--r-- 1 hadoop supergroup 103300 2014-06-21 17:40 /user/hadoop/edgar-results/part-00000

Search WWH ::

Custom Search

Home