Programming with Pig - Hadoop in Action

Databases Reference

In-Depth Information

Two new commands are exec and run . They run Pig scripts while inside the Grunt

shell and can be useful in debugging Pig scripts. The exec command executes a Pig

script in a separate space from the Grunt shell. Aliases defined in the script aren't

visible to the shell and vice versa. The command run executes a Pig script in the same

space as Grunt (also known as interactive mode ). It has the same effect as manually typ-

ing in each line of the script into the Grunt shell.

10.4

Learning Pig Latin through Grunt

Before formally describing Pig's data types and data processing operators, let's run a

few commands in the Grunt

shell to get a feel for how to process data in Pig. For the

purpose of learning, it's more convenient to run Grunt in local mode:

pig -x local

You may want to first try some of the file commands, such as pwd and ls , to orient

yourself around the filesystem.

Let's look at some data. We'll later reuse the patent data we introduced in

chapter 4, but for now let's dig into an interesting data set of query logs from the

Excite

search engine. This data set already comes with the Pig installation, and it's

in the file tutorial/data/excite-small.log under the Pig installation directory.

The data comes in a three-column, tab-separated format. The first column is an

anonymized user ID. The second column is a Unix timestamp, and the third is the

search query. A decidedly non-random sample from the 4,500 records of this file

looks like

3F8AAC2372F6941C 970916093724 minors in possession

C5460576B58BB1CC 970916194352 hacking telenet

9E1707EE57C96C1E 970916073214 buffalo mob crime family

06878125BE78B42C 970916183900 how to make ecstacy

From within Grunt, enter the following statement to load this data into an “alias” (i.e.,

variable) called log .

grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, time, query);

Note that nothing seems to have happened after you entered the statement. In the

Grunt shell, Pig parses your statements but doesn't physically execute them until you

use a DUMP or STORE command to ask for the results. The DUMP command

prints out

the content of an alias whereas the STORE command

stores the content to a file. The

fact that Pig doesn't physically execute any command until you explicitly request

some end result will make sense once you remember that we're processing large data

sets. There's no memory space to “load” the data, and in any case we want to verify

the logic of the execution plan before spending the time and resources to physically

execute it.

We use the DUMP command usually only for development. Most often you'll STORE

significant results into a directory. (Like Hadoop, Pig will automatically partition the

data into files named part- nnnnn .) When you DUMP an alias, you should be sure that

Hadoop in Action

Search WWH ::

Custom Search

Home