Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

This is the library containing the UDF classes that will be called in the Pig script register line.

Next, take a look at the updated version of the Pig script wordcount2.pig by using the Linux cat command, which

employs the newly created UDF function:

[hadoop@hc1nn pig]$ cat wordcount2.pig

01 REGISTER /home/hadoop/pig/wcudfs.jar ;

02

03 DEFINE CleanWS wcudfs.CleanWS() ;

04

05 -- get raw line data from file

06

07 rlines = load '/user/hadoop/pig/10031.txt' AS (rline:chararray);

08

09 -- filter for empty lines

10

11 clines = FILTER rlines BY SIZE(rline) > 0 ;

12

13 -- get list of words

14

15 words = foreach clines generate

16 flatten(TOKENIZE(CleanWS( (chararray) $0 ))) as word ;

17

18 -- group the words by word value

19

20 gword = group words by word ;

21

22 -- create a word count

23

24 wcount = foreach gword generate group, COUNT(words) ;

25

26 -- store the word count

27

28 store wcount into '/user/hadoop/pig/wc_result1' ;

There are some new terms in this script. At line 1, the REGISTER keyword is used to register the word-count UDF

library wcudfs.jar for use with this Pig script.

01 REGISTER /home/hadoop/pig/wcudfs.jar ;

Line 3 uses the DEFINE keyword to refer to the classes of the package within this library that use a single term.

For instance, the class CleanWS in the package wcudfs, in the library wcudfs.jar, can now be called as just CleanWS in

the code.

03 DEFINE CleanWS wcudfs.CleanWS() ;

Line 11 introduces the FILTER keyword. Using this filter removes any lines that are empty from the data set. The

variable clines is used to contain lines that have more than zero characters. This is accomplished by using a check on

the size of the line ( rline ) and ensuring that the size is greater than zero.

11 clines = FILTER rlines BY SIZE(rline) > 0 ;

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home