Database Reference
In-Depth Information
This is a classic inner join, where each match between the two relations corresponds to a
row in the result. (It's actually an equijoin because the join predicate is equality.) The res-
ult's fields are made up of all the fields of all the input relations.
You should use the general join operator when all the relations being joined are too large
to fit in memory. If one of the relations is small enough to fit in memory, you can use a
special type of join called a fragment replicate join , which is implemented by distributing
the small input to all the mappers and performing a map-side join using an in-memory
lookup table against the (fragmented) larger relation. There is a special syntax for telling
Pig to use a fragment replicate join: [ 104 ]
grunt> C = JOIN A BY $0, B BY $1 USING 'replicated';
The first relation must be the large one, followed by one or more small ones (all of which
must fit in memory).
Pig also supports outer joins using a syntax that is similar to SQL's (this is covered for
Hive in Outer joins ) . For example:
grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;
grunt> DUMP C;
(1,Scarf,,)
(2,Tie,Hank,2)
(2,Tie,Joe,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
COGROUP
JOIN always gives a flat structure: a set of tuples. The COGROUP statement is similar to
JOIN , but instead creates a nested set of output tuples. This can be useful if you want to
exploit the structure in subsequent statements:
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Hank,2),(Joe,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
COGROUP generates a tuple for each unique grouping key. The first field of each tuple is
the key, and the remaining fields are bags of tuples from the relations with a matching key.
The first bag contains the matching tuples from relation A with the same key. Similarly,
the second bag contains the matching tuples from relation B with the same key.
Search WWH ::




Custom Search