Databases Reference
In-Depth Information
the AS option after each field. This syntax differs from the LOAD command
where the schema is specified as a list after the AS option, but in both cases we
use AS to specify a schema.
Table 10.8 summarizes the relational operators in Pig Latin. On many operators you'll
see an option for PARALLEL n . The number n is the degree of parallelism you want
for executing that operator. In practice n is the number of reduce tasks in Hadoop
that Pig will use. If you don't set n it'll default to the default setting of your Hadoop
cluster. Pig documentation recommends setting the value of n according to the fol-
lowing guideline:
n = (#nodes - 1) * 0.45 * RAM
where #nodes is the number of nodes and RAM is the amount of memory in GB on
each node.
Table 10.8 Relational operators in Pig Latin
SPLIT
SPLIT alias INTO alias IF expression, alias IF
expression [, alias IF expression ...];
Splits a relation into two or more relations, based on the given Boolean
expressions. Note that a tuple can be assigned to more than one relation, or to
none at all.
UNION
alias = UNION alias, alias, [, alias ...]
Creates the union of two or more relations. Note that
As with any relation, there's no guarantee to the order of tuples
Doesn't require the relations to have the same schema or even the same
number of fields
Doesn't remove duplicate tuples
FILTER
alias = FILTER alias BY expression;
Selects tuples based on Boolean expression. Used to select tuples that you
want or remove tuples that you don't want.
DISTINCT
alias = DISTINCT alias [PARALLEL n];
Remove duplicate tuples.
SAMPLE
alias = SAMPLE alias factor;
Randomly sample a relation. The sampling factor is given in factor . For
example, a 1% sample of data in relation large_data is
small_data = SAMPLE large_data 0.01;
The operation is probabilistic in such a way that the size of small_data will
not be exactly 1% of large_data , and there's no guarantee the operation will
return the same number of tuples each time.
FOREACH
alias = FOREACH alias GENERATE expression [,expression
...] [AS schema];
Loop through each tuple and generate new tuple(s). Usually applied to transform
columns of data, such as adding or deleting fields.
One can optionally specify a schema for the output relation; for example,
naming new fields.
 
Search WWH ::




Custom Search