Database Reference
In-Depth Information
sample tweets file from the attachment folder (
datafiles
)
. This file contains tweet
data, screen names, and the tweet body delimited by '\001' (see
Figure 6-11
)
.
Figure 6-11
.
The sample tweet file delimited by '\001'
Counting Tweets
In this example we will demonstrate running various Pig commands using the interact-
ive Grunt shell. For Pig scripts having a medium level of complexity, we may want to
prepare and run those as Pig scripts, as well. The command to run a Pig script is as fol-
lows:
Pig -x local myscript.pig
Here
myscript.pig
is a compiled Pig script. We can also execute such Pig
scripts in embedded mode as follows:
// Compile to .class file
javac -cp pig.jar MyScript.java
// Running Pig script as java program in embeddeded mode
java -cp:pig.jar:. MyScript
In this exercise, we will explore Apache Pig for running the MapReduce program
for total tweet count and counting tweets for a specific
screen_name
.
1.
First load tweets using
PigStorage
:
tweets = LOAD '/home/vivek/tweets' USING
PigStorage('\ua001') as
(date:chararray,screen_name:chararray,body:chararray);
2.
Let's filter tweets for the screen name
The News Selector
.
name = FILTER tweets by screen_name matches
'The News Selector';