Databases Reference
In-Depth Information
(0,1,2)
(0,5,2)
(1,3,4)
(1,7,8)
grunt> SPLIT c INTO d IF $0 == 0, e IF $0 == 1;
grunt> DUMP d;
(0,1,2)
(0,5,2)
grunt> DUMP e;
(1,3,4)
(1,7,8)
The
UNION
operator allows duplicates. You can use the
DISTINCT
operator to remove
duplicates from a relation. Our
SPLIT
operation on
c
sends a tuple to
d
if its first field
(
$0
) is 0, and to
e
if it's 1. It's possible to write conditions such that some rows will go to
both
d
and
e
or to neither. You can simulate
SPLIT
by multiple
FILTER
operators. The
FILTER
operator alone trims a relation down to only tuples that pass a certain test:
grunt> f = FILTER c BY $1 > 3;
grunt> DUMP f;
(0,5,2)
(1,7,8)
We've seen
LIMIT
being used to take a specified number of tuples from a relation.
SAMPLE
is an operator that randomly samples tuples in a relation according to a speci-
fied percentage.
The operations 'till now are relatively simple in the sense that they operate on each
tuple as an atomic unit. More complex data processing, on the other hand, will require
working on groups of tuples together. We'll next look at operators for grouping. Unlike
previous operators, these grouping operators will create new schemas in their output
that rely heavily on bags and nested data types. The generated schema may take a little
time to get used to at first. Keep in mind that these grouping operators are almost
always for generating intermediate data. Their complexity is only temporary on your
way to computing the final results.
The simpler of these operators is
GROUP
. Continuing with the same set of relations
we used earlier,
grunt> g = GROUP c BY $2;
grunt> DUMP g;
(2,{(0,1,2),(0,5,2)})
(4,{(1,3,4)})
(8,{(1,7,8)})
grunt> DESCRIBE c;
c: {a1: int,a2: int,a3: int}
grunt> DESCRIBE g;
g: {group: int,c: {a1: int,a2: int,a3: int}}
We've created a new relation,
g
, from grouping tuples in
c
having the same value on
the third column (
$2
, also named
a3
). The output of
GROUP
always has two fields. The
first field is group key, which is
a3
in this case. The second field is a bag containing