Database Reference
In-Depth Information
Pig in Practice
There are some practical techniques that are worth knowing about when you are develop-
ing and running Pig programs. This section covers some of them.
Parallelism
When running in MapReduce mode, it's important that the degree of parallelism matches
the size of the dataset. By default, Pig sets the number of reducers by looking at the size of
the input and using one reducer per 1 GB of input, up to a maximum of 999 reducers. You
can override these parameters by setting
pig.exec.reducers.bytes.per.reducer (the default is 1,000,000,000 bytes)
and pig.exec.reducers.max (the default is 999).
To explicitly set the number of reducers you want for each job, you can use a PARALLEL
clause for operators that run in the reduce phase. These include all the grouping and joining
operators ( GROUP , COGROUP , JOIN , CROSS ), as well as DISTINCT and ORDER . The
following line sets the number of reducers to 30 for the GROUP :
grouped_records = GROUP records BY year PARALLEL 30 ;
Alternatively, you can set the default_parallel option, and it will take effect for all
subsequent jobs:
grunt> set default_parallel 30
See Choosing the Number of Reducers for further discussion.
The number of map tasks is set by the size of the input (with one map per HDFS block) and
is not affected by the PARALLEL clause.
Anonymous Relations
You usually apply a diagnostic operator like DUMP or DESCRIBE to the most recently
defined relation. Since this is so common, Pig has a shortcut to refer to the previous rela-
tion: @ . Similarly, it can be tiresome to have to come up with a name for each relation when
using the interpreter. Pig allows you to use the special syntax => to create a relation with
no alias, which can only be referred to with @ . For example:
grunt> => LOAD 'input/ncdc/micro-tab/sample.txt';
grunt> DUMP @
(1950,0,1)
(1950,22,1)
(1950,-11,1)
Search WWH ::




Custom Search