Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Lastly, you can store the contents of the count, currently in variable E , on HDFS in /user/hadoop/pig/wc_result:

grunt> store E into '/user/hadoop/pig/wc_result' ; -- store the results

grunt> quit ; -- quit interactive session

Having quit the Pig interactive session, you can examine the results of this Pig job on HDFS. The Hadoop file

system ls command shows a success file (_SUCCESS), a part file (part-r-00000) containing the word-count data, and a

logs directory. (I have listed the part file from the word count using the Hadoop file system command cat .) Then, you

can use the Linux tail command to view the last 10 lines of the file. Both options are shown here:

[hadoop@hc1nn edgar]$ hadoop dfs -ls /user/hadoop/pig/wc_result

Found 3 items

-rw-r--r-- 1 hadoop supergroup 0 2014-06-18 13:08 /user/hadoop/pig/wc_result/_SUCCESS

drwxr-xr-x - hadoop supergroup 0 2014-06-18 13:08 /user/hadoop/pig/wc_result/_logs

-rw-r--r-- 1 hadoop supergroup 137870 2014-06-18 13:08 /user/hadoop/pig/wc_result/part-r-00000

[hadoop@hc1nn edgar]$ hadoop dfs -cat /user/hadoop/pig/wc_result/part-r-00000 | tail -10

1 http://gutenberg.net/license

1 Dream'--Prospero--Oberon--and

1 http://pglaf.org/fundraising .

1 it!--listen--now--listen!--the

1 http://www.gutenberg.net/GUTINDEX.ALL

1 http://www.gutenberg.net/1/0/2/3/10234

1 http://www.gutenberg.net/2/4/6/8/24689

1 http://www.gutenberg.net/1/0/0/3/10031/

1 http://www.ibiblio.org/gutenberg/etext06

It is quite impressive that, with five lines of Pig commands (ignoring the dump and quit lines), you can run the

same word-count algorithm as took 70 lines of Java code. Less code means lower development costs and, we all hope,

fewer code-based errors.

While efficient, the interactive Pig example does have a drawback: The commands must be manually typed each

time you want to run a word count. Once you're finished, they're lost. The answer to this problem, of course, is to store

the Pig script in a file and run it as a batch Map Reduce job. To demonstrate, I placed the Pig commands from the

previous example into the wordcount.pig file:

[hadoop@hc1nn pig]$ ls -l

total 4

-rw-rw-r--. 1 hadoop hadoop 313 Jun 18 13:24 wordcount.pig

[hadoop@hc1nn pig]$ cat wordcount.pig

01 -- get raw line data from file

02

03 rlines = load '/user/hadoop/pig/10031.txt';

04

05 -- get list of words

06

07 words = foreach rlines generate flatten(TOKENIZE((chararray)$0)) as word;

08

09 -- group the words by word value

10

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home