Database Reference
In-Depth Information
diate pairs, [ 105 ] rather than the one trillion (10 12 ) produced by the naive approach (generat-
ing the cross product of the input) or the approach with no stopword removal.
GROUP
Where COGROUP groups the data in two or more relations, the GROUP statement groups
the data in a single relation. GROUP supports grouping by more than equality of keys: you
can use an expression or user-defined function as the group key. For example, consider the
following relation A :
grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)
Let's group by the number of characters in the second field:
grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;
(5,{(Eve,apple),(Ali,apple)})
(6,{(Joe,banana),(Joe,cherry)})
GROUP creates a relation whose first field is the grouping field, which is given the alias
group . The second field is a bag containing the grouped fields with the same schema as
the original relation (in this case, A ).
There are also two special grouping operations: ALL and ANY . ALL groups all the tuples
in a relation in a single group, as if the GROUP function were a constant:
grunt> C = GROUP A ALL;
grunt> DUMP C;
(all,{(Eve,apple),(Joe,banana),(Ali,apple),(Joe,cherry)})
Note that there is no BY in this form of the GROUP statement. The ALL grouping is com-
monly used to count the number of tuples in a relation, as shown in Validation and nulls .
The ANY keyword is used to group the tuples in a relation randomly, which can be useful
for sampling.
Sorting Data
Relations are unordered in Pig. Consider a relation A :
grunt> DUMP A;
(2,3)
Search WWH ::




Custom Search