Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Line 16 calls the user-defined function named CleanWS , which removes unwanted characters from the input text.

15 words = foreach clines generate

16 flatten(TOKENIZE(CleanWS( (chararray) $0 ))) as word ;

I have created some Bash shell scripts to assist in running this second Pig job. This is just to provide an example

of how to speed up a manual job. For instance, instead of having to type the Pig job execution command each time,

I can just execute a simple script. Instead of having to manually delete the job results directory for a job rerun, I can

run a clean script. Here is the clean_wc.sh script that was used to delete the job results HDFS directory, employing the

Linux cat command:

[hadoop@hc1nn pig]$ cat clean_wc.sh

01 #!/bin/bash

02

03 # remove the pig script results directory

04

05 hadoop dfs -rmr /user/hadoop/pig/wc_result1

The script does this by calling the Hadoop file system rmr command to remove the directory and its contents.

The next script that is run_wc2.sh, which is used to run the job, calls the clean script (at line 5) each time it is run.

This single script cleans the results directory on HDFS and runs the wordcount2.pig job:

[hadoop@hc1nn pig]$ cat run_wc2.sh

01 #!/bin/bash

02

03 # run the pig wc 2 job

04

05 ./clean_wc.sh

06

07 pig -stop_on_failure wordcount2.pig

This shell script calls the clean_wc.sh script and then invokes the Pig wordcount2.pig script. The pig command

on line 7 is called with a flag ( -stop_on_failure ), telling it to stop as soon as it encounters an error. The results are

listed via the result_wc.sh script:

[hadoop@hc1nn pig]$ cat result_wc.sh

01 #!/bin/bash

02

03 # remove the pig script results directory

04

05 hadoop dfs -ls /user/hadoop/pig/wc_result1

07

08 hadoop dfs -cat /user/hadoop/pig/wc_result1/part-r-00000 | tail -10

Search WWH ::

Custom Search

Home