Database Reference
In-Depth Information
Using LIMIT can improve the performance of a query because Pig tries to apply the limit
as early as possible in the processing pipeline, to minimize the amount of data that needs
to be processed. For this reason, you should always use LIMIT if you are not interested in
the entire output.
Combining and Splitting Data
Sometimes you have several relations that you would like to combine into one. For this,
the UNION statement is used. For example:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
grunt> DUMP B;
(z,x,8)
(w,y,1)
grunt> C = UNION A, B;
grunt> DUMP C;
(2,3)
(z,x,8)
(1,2)
(w,y,1)
(2,4)
C is the union of relations A and B , and because relations are unordered, the order of the
tuples in C is undefined. Also, it's possible to form the union of two relations with differ-
ent schemas or with different numbers of fields, as we have done here. Pig attempts to
merge the schemas from the relations that UNION is operating on. In this case, they are in-
compatible, so C has no schema:
grunt> DESCRIBE A;
A: {f0: int,f1: int}
grunt> DESCRIBE B;
B: {f0: chararray,f1: chararray,f2: int}
grunt> DESCRIBE C;
Schema for C unknown.
If the output relation has no schema, your script needs to be able to handle tuples that vary
in the number of fields and/or types.
The SPLIT operator is the opposite of UNION : it partitions a relation into two or more re-
lations. See Validation and nulls for an example of how to use it.
Search WWH ::




Custom Search