Database Reference
In-Depth Information
The Core Crunch API
This section presents the core interfaces in Crunch. Crunch's API is high level by design,
so the programmer can concentrate on the logical operations of the computation, rather
than the details of how it is executed.
Primitive Operations
The core data structure in Crunch is PCollection<S> , an immutable, unordered, dis-
tributed collection of elements of type S . In this section, we examine the primitive opera-
tions on PCollection and its derived types, PTable and PGroupedTable .
union()
The simplest primitive Crunch operation is union() , which returns a PCollection
that contains all the elements of the PCollection it is invoked on and the PCollec-
tion supplied as an argument. For example:
PCollection < Integer > a = MemPipeline . collectionOf ( 1 , 3 );
PCollection < Integer > b = MemPipeline . collectionOf ( 2 );
PCollection < Integer > c = a . union ( b );
assertEquals ( "{2,1,3}" , dump ( c ));
MemPipeline 's collectionOf() method is used to create a PCollection in-
stance from a small number of elements, normally for the purposes of testing or demonstra-
tion. The dump() method is a utility method introduced here for rendering the contents of
a small PCollection as a string (it's not a part of Crunch, but you can find the imple-
mentation in the PCollections class in the example code that accompanies this topic).
Since PCollection s are unordered, the order of the elements in c is undefined.
When forming the union of two PCollection s, they must have been created from the
same pipeline (or the operation will fail at runtime), and they must have the same type. The
latter condition is enforced at compile time, since PCollection is a parameterized type
and the type arguments for the PCollection s in the union must match.
parallelDo()
The second primitive operation is parallelDo() for calling a function on every element
in an input PCollection<S> and returning a new output PCollection<T> contain-
ing the results of the function calls. In its simplest form, parallelDo() takes two argu-
ments: a DoFn<S, T> implementation that defines a function transforming elements of
Search WWH ::




Custom Search