Database Reference
In-Depth Information
How are schemas propagated to new relations? Some relational operators don't change the
schema, so the relation produced by the
LIMIT
operator (which restricts a relation to a
maximum number of tuples), for example, has the same schema as the relation it operates
on. For other operators, the situation is more complicated.
UNION
, for example, combines
two or more relations into one and tries to merge the input relations' schemas. If the
schemas are incompatible, due to different types or number of fields, then the schema of
the result of the
UNION
is unknown.
You can find out the schema for any relation in the data flow using the
DESCRIBE
oper-
ator. If you want to redefine the schema for a relation, you can use the
FOREACH...GENERATE
operator with
AS
clauses to define the schema for some or all
of the fields of the input relation.
See
User-Defined Functions
for a further discussion of schemas.
Functions
Functions in Pig come in four types:
Eval function
A function that takes one or more expressions and returns another expression. An ex-
ample of a built-in eval function is
MAX
, which returns the maximum value of the
entries in a bag. Some eval functions are
aggregate functions
, which means they oper-
ate on a bag of data to produce a scalar value;
MAX
is an example of an aggregate func-
tion. Furthermore, many aggregate functions are
algebraic
, which means that the result
of the function may be calculated incrementally. In MapReduce terms, algebraic func-
tions make use of the combiner and are much more efficient to calculate (see
Combiner
a collection of values is an example of a function that is not algebraic.
Filter function
A special type of eval function that returns a logical Boolean result. As the name sug-
gests, filter functions are used in the
FILTER
operator to remove unwanted rows.
They can also be used in other relational operators that take Boolean conditions, and in
general, in expressions using Boolean or conditional expressions. An example of a
built-in filter function is
IsEmpty
, which tests whether a bag or a map contains any
items.
Load function
A function that specifies how to load data into a relation from external storage.