Database Reference
In-Depth Information
ing the cross product of the input) or the approach with no stopword removal.
GROUP
Where
COGROUP
groups the data in two or more relations, the
GROUP
statement groups
the data in a single relation.
GROUP
supports grouping by more than equality of keys: you
can use an expression or user-defined function as the group key. For example, consider the
following relation
A
:
grunt>
DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)
Let's group by the number of characters in the second field:
grunt>
B = GROUP A BY SIZE($1);
grunt>
DUMP B;
(5,{(Eve,apple),(Ali,apple)})
(6,{(Joe,banana),(Joe,cherry)})
GROUP
creates a relation whose first field is the grouping field, which is given the alias
group
. The second field is a bag containing the grouped fields with the same schema as
the original relation (in this case,
A
).
There are also two special grouping operations:
ALL
and
ANY
.
ALL
groups all the tuples
in a relation in a single group, as if the
GROUP
function were a constant:
grunt>
C = GROUP A ALL;
grunt>
DUMP C;
(all,{(Eve,apple),(Joe,banana),(Ali,apple),(Joe,cherry)})
Note that there is no
BY
in this form of the
GROUP
statement. The
ALL
grouping is com-
monly used to count the number of tuples in a relation, as shown in
Validation and nulls
.
The
ANY
keyword is used to group the tuples in a relation randomly, which can be useful
for sampling.
Sorting Data
Relations are unordered in Pig. Consider a relation
A
:
grunt>
DUMP A;
(2,3)