Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

You can run the job via the Hadoop jar command. The parameters passed to it are the library file you have just

created, the name of the class to run in that library, the input directory on HDFS, and the output directory:

[hadoop@hc1nn wordcount]$ hadoop jar ./wordcount1.jar org.myorg.WordCount /user/hadoop/edgar /user/

hadoop/edgar-results

14/06/15 16:04:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library

14/06/15 16:04:50 INFO mapred.FileInputFormat: Total input paths to process : 5

14/06/15 16:04:51 INFO mapred.JobClient: Running job: job_201406151602_0001

14/06/15 16:04:52 INFO mapred.JobClient: map 0% reduce 0%

14/06/15 16:05:02 INFO mapred.JobClient: map 20% reduce 0%

14/06/15 16:05:03 INFO mapred.JobClient: map 40% reduce 0%

14/06/15 16:05:04 INFO mapred.JobClient: map 60% reduce 0%

........................

14/06/15 16:05:19 INFO mapred.JobClient: Combine input records=284829

14/06/15 16:05:19 INFO mapred.JobClient: Reduce input records=55496

14/06/15 16:05:19 INFO mapred.JobClient: Reduce input groups=36348

14/06/15 16:05:19 INFO mapred.JobClient: Combine output records=55496

14/06/15 16:05:19 INFO mapred.JobClient: Physical memory (bytes) snapshot=912035840

14/06/15 16:05:19 INFO mapred.JobClient: Reduce output records=36348

14/06/15 16:05:19 INFO mapred.JobClient: Virtual memory (bytes) snapshot=7949012992

14/06/15 16:05:19 INFO mapred.JobClient: Map output records=284829

The job has completed (the output shown above has been trimmed), so you can check the output on HDFS

under /user/hadoop/edgar-results/ by using the Hadoop file system ls command:

[hadoop@hc1nn wordcount]$ hadoop dfs -ls /user/hadoop/edgar-results/

Found 3 items

-rw-r--r-- 1 hadoop supergroup 0 2014-06-15 16:05 /user/hadoop/edgar-results/_SUCCESS

drwxr-xr-x - hadoop supergroup 0 2014-06-15 16:04 /user/hadoop/edgar-results/_logs

-rw-r--r-- 1 hadoop supergroup 396500 2014-06-15 16:05 /user/hadoop/edgar-results/part-00000

These results show a _SUCCESS file, so the job was completed without error. As in previous examples, you use

the Hadoop file system cat command to dump the contents of the results file and the Linux head command to limit

the job results to the first 10 rows:

[hadoop@hc1nn wordcount]$ hadoop dfs -cat /user/hadoop/edgar-results/part-00000 | head -10

!) 1

"''T 1

"'And 1

"'As 1

"'Be 2

"'But--still--monsieur----' 1

"'Catherine, 1

"'Comb 1

"'Come 1

"'Eyes,' 1

Well done! You have just compiled and run your own native Map Reduce job from a source file. To create more,

you can simply change the algorithm in Java (or write your own) and follow the same process. One change that might

be useful is to ignore the white-space and symbol characters when counting the words. The example's output data

contains characters like these (“ or -). The next example adds these refinements.

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home