Programming with Pig - Hadoop in Action

Databases Reference

In-Depth Information

the AS option after each field. This syntax differs from the LOAD command

where the schema is specified as a list after the AS option, but in both cases we

use AS to specify a schema.

Table 10.8 summarizes the relational operators in Pig Latin. On many operators you'll

see an option for PARALLEL n . The number n is the degree of parallelism you want

for executing that operator. In practice n is the number of reduce tasks in Hadoop

that Pig will use. If you don't set n it'll default to the default setting of your Hadoop

cluster. Pig documentation recommends setting the value of n according to the fol-

lowing guideline:

n = (#nodes - 1) * 0.45 * RAM

where #nodes is the number of nodes and RAM is the amount of memory in GB on

each node.

Table 10.8 Relational operators in Pig Latin

SPLIT

SPLIT alias INTO alias IF expression, alias IF

expression [, alias IF expression ...];

Splits a relation into two or more relations, based on the given Boolean

expressions. Note that a tuple can be assigned to more than one relation, or to

none at all.

UNION

alias = UNION alias, alias, [, alias ...]

Creates the union of two or more relations. Note that

■

As with any relation, there's no guarantee to the order of tuples

Doesn't require the relations to have the same schema or even the same

■

number of fields

Doesn't remove duplicate tuples

■

FILTER

alias = FILTER alias BY expression;

Selects tuples based on Boolean expression. Used to select tuples that you

want or remove tuples that you don't want.

DISTINCT

alias = DISTINCT alias [PARALLEL n];

Remove duplicate tuples.

SAMPLE

alias = SAMPLE alias factor;

Randomly sample a relation. The sampling factor is given in factor . For

example, a 1% sample of data in relation large_data is

small_data = SAMPLE large_data 0.01;

The operation is probabilistic in such a way that the size of small_data will

not be exactly 1% of large_data , and there's no guarantee the operation will

return the same number of tuples each time.

FOREACH

alias = FOREACH alias GENERATE expression [,expression

...] [AS schema];

Loop through each tuple and generate new tuple(s). Usually applied to transform

columns of data, such as adding or deleting fields.

One can optionally specify a schema for the output relation; for example,

naming new fields.

Hadoop in Action

Search WWH ::

Custom Search

Home