Database Reference
In-Depth Information
type S to type T , and a PType<T> instance to describe the output type T . ( PType s are
explained in more detail in the section Types .)
The following code snippet shows how to use parallelDo() to apply a string length
function to a PCollection of strings:
PCollection < String > a = MemPipeline . collectionOf ( "cherry" , "apple" ,
"banana" );
PCollection < Integer > b = a . parallelDo ( new DoFn < String , Integer >() {
@Override
public void process ( String input , Emitter < Integer > emitter ) {
emitter . emit ( input . length ());
}
}, ints ());
assertEquals ( "{6,5,6}" , dump ( b ));
In this case, the output PCollection of integers has the same number of elements as
the input, so we could have used the MapFn subclass of DoFn for 1:1 mappings:
PCollection < Integer > b = a . parallelDo ( new MapFn < String , Integer >() {
@Override
public Integer map ( String input ) {
return input . length ();
}
}, ints ());
assertEquals ( "{6,5,6}" , dump ( b ));
One common use of parallelDo() is for filtering out data that is not needed in later
processing steps. Crunch provides a filter() method for this purpose that takes a spe-
cial DoFn called FilterFn . Implementors need only implement the accept() meth-
od to indicate whether an element should be in the output. For example, this code retains
only those strings with an even number of characters:
PCollection < String > b = a . filter ( new FilterFn < String >() {
@Override
public boolean accept ( String input ) {
return input . length () % 2 == 0 ; // even
}
});
assertEquals ( "{cherry,banana}" , dump ( b ));
Notice that there is no PType in the method signature for filter() , since the output
PCollection has the same type as the input.
Search WWH ::




Custom Search