Information Technology Reference
In-Depth Information
EXERCISE 9.4
Using Pig to Group and Join Items Based on Some Criteria
Pig is a distributed high-level Map-Reduce language for expressing data analysis pro-
grams to analyze large data sets on HDFS. Before beginning, please make sure that
Pig ( http://pig.apache.org ) is installed along with Hadoop and Java. A single-node
Hadoop setup will be sufficient for this exercise.
Let us assume that we have some stock exchange data and we want to analyze it.
A sample of data in JavaScript Object Notation (JSON) format is given here. Pig supports
a variety of input and output formats, but for this exercise we will use JSON. It is one of
the most widely used industry formats.
[{"date":"2009-12-31","dividends":"35.39","symbol":"CLI","exchange":"NYSE"},
{"date":"2009-12-30","dividends":"35.22","symbol":"CLI","exchange":"NYSE"},
{"date":"2009-12-29","dividends":"35.69","symbol":"CLS","exchange":"NYSE"},
{"date":"2009-12-28","dividends":"35.67","symbol":"CLS","exchange":"NYSE"},
{"date":"2009-12-24","dividends":"35.38","symbol":"CGW","exchange":"NYSE"},
{"date":"2009-12-23","dividends":"35.13","symbol":"CGW","exchange":"NYSE"},
{"date":"2009-12-22","dividends":"34.76","symbol":"CWW","exchange":"NYSE"},
{"date":"2009-12-21","dividends":"34.65","symbol":"CWW","exchange":"NYSE"}]
Once we have copied NYSE_dividends.json to the HDFS, we need to write a Pig script to
execute over Hadoop. Let us say we want to group this data by symbol and call this script
group.pig . Here symbol represents company stock.
--group.pig
divs = load 'hdfs://localhost:9000/user/myuser/NYSE_dividends.json'
using JsonLoader('date:chararray, dividends:chararray, symbol:chararray,
exchange:chararray');
grpd = group divs by symbol;
store grpd into 'hdfs://localhost:9000/user/myuser/grouped' using JsonStorage();
Next we run the script on the cluster. The local switch option tells Pig to run on the single-
node cluster deployed on the local machine. In case of the default statement (without the
-x switch) Pig would assume it needs to be run in distributed mode and would try to ship
the job to the distributed cluster.
$ pig -x local group.pig
Search WWH ::




Custom Search