Database Reference
In-Depth Information
records =
LOAD
'input/ncdc/micro-tab/sample.txt'
AS
(year:
chararray
, temperature:
int
, quality:
int
);
filtered_records =
FILTER
records
BY
temperature !=
9999
AND
quality
IN
(
0
,
1
,
4
,
5
,
9
);
max_temp =
max_by_group
(filtered_records, year, temperature);
DUMP
max_temp
At runtime, Pig will expand the macro using the macro definition. After expansion, the
program looks like the following, with the expanded section in bold:
records =
LOAD
'input/ncdc/micro-tab/sample.txt'
AS
(year:
chararray
, temperature:
int
, quality:
int
);
filtered_records =
FILTER
records
BY
temperature !=
9999
AND
quality
IN
(
0
,
1
,
4
,
5
,
9
);
macro_max_by_group_A_0 =
GROUP
filtered_records
by
(year);
max_temp =
FOREACH
macro_max_by_group_A_0
GENERATE group
,
MAX
(filtered_records.(temperature));
DUMP
max_temp
Normally you don't see the expanded form, because Pig creates it internally; however, in
some cases it is useful to see it when writing and debugging macros. You can get Pig to
perform macro expansion only (without executing the script) by passing the
-dryrun
ar-
gument to
pig
.
Notice that the parameters that were passed to the macro (
filtered_records
,
year
,
and
temperature
) have been substituted for the names in the macro definition. Aliases
in the macro definition that don't have a
$
prefix, such as
A
in this example, are local to
the macro definition and are rewritten at expansion time to avoid conflicts with aliases in
other parts of the program. In this case,
A
becomes
macro_max_by_group_A_0
in the
expanded form.
To foster reuse, macros can be defined in separate files to Pig scripts, in which case they
need to be imported into any script that uses them. An import statement looks like this:
IMPORT
'./ch16-pig/src/main/pig/max_temp.macro'
;