Database Reference
In-Depth Information
Category Function
Description
ParquetLoader ,
ParquetStorer
Loads or stores relations from or to Parquet files.
Loads or stores relations from or to Hive ORCFiles.
OrcStorage
HBaseStorage Loads or stores relations from or to HBase tables.
[ a ] The default storage can be changed by setting pig.default.load.func and pig.default.store.func to the
fully qualified load and store function classnames.
Other libraries
If the function you need is not available, you can write your own user-defined function (or
UDF for short), as explained in User-Defined Functions . Before you do that, however,
have a look in the Piggy Bank , a library of Pig functions shared by the Pig community and
distributed as a part of Pig. For example, there are load and store functions in the Piggy
Bank for CSV files, Hive RCFiles, sequence files, and XML files. The Piggy Bank JAR
file comes with Pig, and you can use it with no further configuration. Pig's API document-
ation includes a list of functions provided by the Piggy Bank.
Apache DataFu is another rich library of Pig UDFs. In addition to general utility func-
tions, it includes functions for computing basic statistics, performing sampling and estim-
ation, hashing, and working with web data (sessionization, link analysis).
Macros
Macros provide a way to package reusable pieces of Pig Latin code from within Pig Latin
itself. For example, we can extract the part of our Pig Latin program that performs group-
ing on a relation and then finds the maximum value in each group by defining a macro as
follows:
DEFINE max_by_group (X, group_key, max_field) RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group , MAX ($X.$max_field);
};
The macro, called max_by_group , takes three parameters: a relation, X , and two field
names, group_key and max_field . It returns a single relation, Y . Within the macro
body, parameters and return aliases are referenced with a $ prefix, such as $X .
The macro is used as follows:
Search WWH ::




Custom Search