Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Pig in Practice

There are some practical techniques that are worth knowing about when you are develop-

ing and running Pig programs. This section covers some of them.

Parallelism

When running in MapReduce mode, it's important that the degree of parallelism matches

the size of the dataset. By default, Pig sets the number of reducers by looking at the size of

the input and using one reducer per 1 GB of input, up to a maximum of 999 reducers. You

can override these parameters by setting

pig.exec.reducers.bytes.per.reducer (the default is 1,000,000,000 bytes)

and pig.exec.reducers.max (the default is 999).

To explicitly set the number of reducers you want for each job, you can use a PARALLEL

clause for operators that run in the reduce phase. These include all the grouping and joining

operators ( GROUP , COGROUP , JOIN , CROSS ), as well as DISTINCT and ORDER . The

following line sets the number of reducers to 30 for the GROUP :

grouped_records = GROUP records BY year PARALLEL 30 ;

Alternatively, you can set the default_parallel option, and it will take effect for all

subsequent jobs:

grunt> set default_parallel 30

See Choosing the Number of Reducers for further discussion.

The number of map tasks is set by the size of the input (with one map per HDFS block) and

is not affected by the PARALLEL clause.

Anonymous Relations

You usually apply a diagnostic operator like DUMP or DESCRIBE to the most recently

defined relation. Since this is so common, Pig has a shortcut to refer to the previous rela-

tion: @ . Similarly, it can be tiresome to have to come up with a name for each relation when

using the interpreter. Pig allows you to use the special syntax => to create a relation with

no alias, which can only be referred to with @ . For example:

grunt> => LOAD 'input/ncdc/micro-tab/sample.txt';

grunt> DUMP @

(1950,0,1)

(1950,22,1)

(1950,-11,1)

Search WWH ::

Custom Search

Home