Database Reference
In-Depth Information
This is the library containing the UDF classes that will be called in the Pig script register line.
Next, take a look at the updated version of the Pig script wordcount2.pig by using the Linux cat command, which
employs the newly created UDF function:
[hadoop@hc1nn pig]$ cat wordcount2.pig
01 REGISTER /home/hadoop/pig/wcudfs.jar ;
02
03 DEFINE CleanWS wcudfs.CleanWS() ;
04
05 -- get raw line data from file
06
07 rlines = load '/user/hadoop/pig/10031.txt' AS (rline:chararray);
08
09 -- filter for empty lines
10
11 clines = FILTER rlines BY SIZE(rline) > 0 ;
12
13 -- get list of words
14
15 words = foreach clines generate
16 flatten(TOKENIZE(CleanWS( (chararray) $0 ))) as word ;
17
18 -- group the words by word value
19
20 gword = group words by word ;
21
22 -- create a word count
23
24 wcount = foreach gword generate group, COUNT(words) ;
25
26 -- store the word count
27
28 store wcount into '/user/hadoop/pig/wc_result1' ;
There are some new terms in this script. At line 1, the REGISTER keyword is used to register the word-count UDF
library wcudfs.jar for use with this Pig script.
01 REGISTER /home/hadoop/pig/wcudfs.jar ;
Line 3 uses the DEFINE keyword to refer to the classes of the package within this library that use a single term.
For instance, the class CleanWS in the package wcudfs, in the library wcudfs.jar, can now be called as just CleanWS in
the code.
03 DEFINE CleanWS wcudfs.CleanWS() ;
Line 11 introduces the FILTER keyword. Using this filter removes any lines that are empty from the data set. The
variable clines is used to contain lines that have more than zero characters. This is accomplished by using a check on
the size of the line ( rline ) and ensuring that the size is greater than zero.
11 clines = FILTER rlines BY SIZE(rline) > 0 ;
 
Search WWH ::




Custom Search