Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Building the code into a jar library using the jar command creates the wordcount1.jar file:

[hadoop@hc1nn wordcount]$ jar -cvf ./wordcount1.jar -C wc_classes .

added manifest

adding: org/(in = 0) (out= 0)(stored 0%)

adding: org/myorg/(in = 0) (out= 0)(stored 0%)

adding: org/myorg/WordCount.class(in = 1546) (out= 750)(deflated 51%)

adding: org/myorg/WordCount$Reduce.class(in = 1611) (out= 648)(deflated 59%)

adding: org/myorg/WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)

[hadoop@hc1nn wordcount]$ ls -l *.jar

-rw-rw-r--. 1 hadoop hadoop 3169 Jun 15 15:05 wordcount1.jar

This file can now be used to run a word-count task on Hadoop. As in previous Map Reduce runs, the input and

output data for the job will be taken from HDFS. To provide the words to count, I copied some data from Edgar Allan

Poe topics into a directory on HDFS from the Linux file system. The Linux ls command shows the text files that will

be used:

[hadoop@hc1nn wordcount]$ ls $HOME/edgar

10031.txt 15143.txt 17192.txt 2149.txt 932.txt

Copying these files to the HDFS directory called /user/hadoop/edgar, using the Hadoop file system

copyFromLocal command, sets up the data for the word-count job:

[hadoop@hc1nn wordcount]$ hadoop dfs -copyFromLocal $HOME/edgar/* /user/hadoop/edgar

[hadoop@hc1nn wordcount]$ hadoop dfs -ls /user/hadoop/edgar

Found 5 items

-rw-r--r-- 1 hadoop supergroup 410012 2014-06-15 15:53 /user/hadoop/edgar/10031.txt

-rw-r--r-- 1 hadoop supergroup 559352 2014-06-15 15:53 /user/hadoop/edgar/15143.txt

-rw-r--r-- 1 hadoop supergroup 66401 2014-06-15 15:53 /user/hadoop/edgar/17192.txt

-rw-r--r-- 1 hadoop supergroup 596736 2014-06-15 15:53 /user/hadoop/edgar/2149.txt

-rw-r--r-- 1 hadoop supergroup 63278 2014-06-15 15:53 /user/hadoop/edgar/932.txt

By running the word-count example against the data in the input directory (/user/hadoop/edgar), you create the

results data in the output directory (/user/hadoop/edgar-results). First, though, make sure the processes are all up

before you run the job using jps .

[hadoop@hc1nn wordcount]$ jps

1959 SecondaryNameNode

1839 DataNode

4166 TaskTracker

4272 Jps

1720 NameNode

4044 JobTracker

This shows that the HDFS processes for the data node and name node are running on hc1nn. Also, the Map

Reduce processes for the Task and Job Trackers are running. If you are going to rerun this job, then you will need to

delete the HDFS-based results directory by using the Hadoop file system rmr command:

[hadoop@hc1nn wordcount]$ hadoop dfs -rmr /user/hadoop/edgar-results

Search WWH ::

Custom Search

Home