Databases Reference
In-Depth Information
Two new commands are exec and run . They run Pig scripts while inside the Grunt
shell and can be useful in debugging Pig scripts. The exec command executes a Pig
script in a separate space from the Grunt shell. Aliases defined in the script aren't
visible to the shell and vice versa. The command run executes a Pig script in the same
space as Grunt (also known as interactive mode ). It has the same effect as manually typ-
ing in each line of the script into the Grunt shell.
10.4
Learning Pig Latin through Grunt
Before formally describing Pig's data types and data processing operators, let's run a
few commands in the Grunt
shell to get a feel for how to process data in Pig. For the
purpose of learning, it's more convenient to run Grunt in local mode:
pig -x local
You may want to first try some of the file commands, such as pwd and ls , to orient
yourself around the filesystem.
Let's look at some data. We'll later reuse the patent data we introduced in
chapter 4, but for now let's dig into an interesting data set of query logs from the
Excite
search engine. This data set already comes with the Pig installation, and it's
in the file tutorial/data/excite-small.log under the Pig installation directory.
The data comes in a three-column, tab-separated format. The first column is an
anonymized user ID. The second column is a Unix timestamp, and the third is the
search query. A decidedly non-random sample from the 4,500 records of this file
looks like
3F8AAC2372F6941C 970916093724 minors in possession
C5460576B58BB1CC 970916194352 hacking telenet
9E1707EE57C96C1E 970916073214 buffalo mob crime family
06878125BE78B42C 970916183900 how to make ecstacy
From within Grunt, enter the following statement to load this data into an “alias” (i.e.,
variable) called log .
grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, time, query);
Note that nothing seems to have happened after you entered the statement. In the
Grunt shell, Pig parses your statements but doesn't physically execute them until you
use a DUMP or STORE command to ask for the results. The DUMP command
prints out
the content of an alias whereas the STORE command
stores the content to a file. The
fact that Pig doesn't physically execute any command until you explicitly request
some end result will make sense once you remember that we're processing large data
sets. There's no memory space to “load” the data, and in any case we want to verify
the logic of the execution plan before spending the time and resources to physically
execute it.
We use the DUMP command usually only for development. Most often you'll STORE
significant results into a directory. (Like Hadoop, Pig will automatically partition the
data into files named part- nnnnn .) When you DUMP an alias, you should be sure that
 
Search WWH ::




Custom Search