Database Reference
In-Depth Information
The Core Crunch API
This section presents the core interfaces in Crunch. Crunch's API is high level by design,
so the programmer can concentrate on the logical operations of the computation, rather
than the details of how it is executed.
Primitive Operations
The core data structure in Crunch is
PCollection<S>
, an immutable, unordered, dis-
tributed collection of elements of type
S
. In this section, we examine the primitive opera-
tions on
PCollection
and its derived types,
PTable
and
PGroupedTable
.
union()
The simplest primitive Crunch operation is
union()
, which returns a
PCollection
that contains all the elements of the
PCollection
it is invoked on and the
PCollec-
tion
supplied as an argument. For example:
PCollection
<
Integer
>
a
=
MemPipeline
.
collectionOf
(
1
,
3
);
PCollection
<
Integer
>
b
=
MemPipeline
.
collectionOf
(
2
);
PCollection
<
Integer
>
c
=
a
.
union
(
b
);
assertEquals
(
"{2,1,3}"
,
dump
(
c
));
MemPipeline
's
collectionOf()
method is used to create a
PCollection
in-
stance from a small number of elements, normally for the purposes of testing or demonstra-
tion. The
dump()
method is a utility method introduced here for rendering the contents of
a small
PCollection
as a string (it's not a part of Crunch, but you can find the imple-
mentation in the
PCollections
class in the example code that accompanies this topic).
Since
PCollection
s are unordered, the order of the elements in
c
is undefined.
When forming the union of two
PCollection
s, they must have been created from the
same pipeline (or the operation will fail at runtime), and they must have the same type. The
latter condition is enforced at compile time, since
PCollection
is a parameterized type
and the type arguments for the
PCollection
s in the union must match.
parallelDo()
The second primitive operation is
parallelDo()
for calling a function on every element
in an input
PCollection<S>
and returning a new output
PCollection<T>
contain-
ing the results of the function calls. In its simplest form,
parallelDo()
takes two argu-
ments: a
DoFn<S, T>
implementation that defines a function transforming elements of