Database Reference
In-Depth Information
FOREACH...GENERATE
statements, and adds them to the logical plan without execut-
ing them. The trigger for Pig to start execution is the
DUMP
statement. At that point, the
logical plan is compiled into a physical plan and executed.
MULTIQUERY EXECUTION
Because
DUMP
is a diagnostic tool, it will always trigger execution. However, the
STORE
command is
different. In interactive mode,
STORE
acts like
DUMP
and will always trigger execution (this includes the
run
command), but in batch mode it will not (this includes the
exec
command). The reason for this is
efficiency. In batch mode, Pig will parse the whole script to see whether there are any optimizations that
could be made to limit the amount of data to be written to or read from disk. Consider the following
simple example:
A =
LOAD
'input/pig/multiquery/A'
;
B =
FILTER
A
BY
$1 ==
'banana'
;
C =
FILTER
A
BY
$1 !=
'banana'
;
STORE
B
INTO
'output/b'
;
STORE
C
INTO
'output/c'
;
Relations
B
and
C
are both derived from
A
, so to save reading
A
twice, Pig can run this script as a single
MapReduce job by reading
A
once and writing two output files from the job, one for each of
B
and
C
.
This feature is called
multiquery execution
.
In previous versions of Pig that did not have multiquery execution, each
STORE
statement in a script run
in batch mode triggered execution, resulting in a job for each
STORE
statement. It is possible to restore
the old behavior by disabling multiquery execution with the
-M
or
-no_multiquery
option to
pig
.
The physical plan that Pig prepares is a series of MapReduce jobs, which in local mode
Pig runs in the local JVM and in MapReduce mode Pig runs on a Hadoop cluster.
NOTE
You can see the logical and physical plans created by Pig using the
EXPLAIN
command on a relation
(
EXPLAIN max_temp;
, for example).
EXPLAIN
will also show the MapReduce plan, which shows how the physical operators are grouped in-
to MapReduce jobs. This is a good way to find out how many MapReduce jobs Pig will run for your
query.
The relational operators that can be a part of a logical plan in Pig are summarized in
Table 16-1
. We go through the operators in more detail in
Data Processing Operators
.