Database Reference
In-Depth Information
Line 16 calls the user-defined function named CleanWS , which removes unwanted characters from the input text.
15 words = foreach clines generate
16 flatten(TOKENIZE(CleanWS( (chararray) $0 ))) as word ;
I have created some Bash shell scripts to assist in running this second Pig job. This is just to provide an example
of how to speed up a manual job. For instance, instead of having to type the Pig job execution command each time,
I can just execute a simple script. Instead of having to manually delete the job results directory for a job rerun, I can
run a clean script. Here is the clean_wc.sh script that was used to delete the job results HDFS directory, employing the
Linux cat command:
[hadoop@hc1nn pig]$ cat clean_wc.sh
01 #!/bin/bash
02
03 # remove the pig script results directory
04
05 hadoop dfs -rmr /user/hadoop/pig/wc_result1
The script does this by calling the Hadoop file system rmr command to remove the directory and its contents.
The next script that is run_wc2.sh, which is used to run the job, calls the clean script (at line 5) each time it is run.
This single script cleans the results directory on HDFS and runs the wordcount2.pig job:
[hadoop@hc1nn pig]$ cat run_wc2.sh
01 #!/bin/bash
02
03 # run the pig wc 2 job
04
05 ./clean_wc.sh
06
07 pig -stop_on_failure wordcount2.pig
This shell script calls the clean_wc.sh script and then invokes the Pig wordcount2.pig script. The pig command
on line 7 is called with a flag ( -stop_on_failure ), telling it to stop as soon as it encounters an error. The results are
listed via the result_wc.sh script:
[hadoop@hc1nn pig]$ cat result_wc.sh
01 #!/bin/bash
02
03 # remove the pig script results directory
04
05 hadoop dfs -ls /user/hadoop/pig/wc_result1
07
08 hadoop dfs -cat /user/hadoop/pig/wc_result1/part-r-00000 | tail -10
 
Search WWH ::




Custom Search