Database Reference
In-Depth Information
— for example, to test for convergence in an iterative algorithm (see Iterative Al-
gorithms ) .
There are a few ways of materializing a PCollection ; the most direct way to accom-
plish this is to call materialize() , which returns an Iterable collection of its val-
ues. If the PCollection has not already been materialized, then Crunch will have to
run the pipeline to ensure that the objects in the PCollection have been computed and
stored in a temporary intermediate file so they can be iterated over. [ 122 ]
Consider the following Crunch program for lowercasing lines in a text file:
Pipeline pipeline = new MRPipeline ( getClass ());
PCollection < String > lines = pipeline . readTextFile ( inputPath );
PCollection < String > lower = lines . parallelDo ( new ToLowerFn (),
strings ());
Iterable < String > materialized = lower . materialize ();
for ( String s : materialized ) { // pipeline is run
System . out . println ( s );
}
pipeline . done ();
The lines from the text file are transformed using the ToLowerFn function, which is
defined separately so we can use it again later:
public class ToLowerFn extends DoFn < String , String > {
@Override
public void process ( String input , Emitter < String > emitter ) {
emitter . emit ( input . toLowerCase ());
}
}
The call to materialize() on the variable lower returns an Iterable<String> ,
but it is not this method call that causes the pipeline to be run. It is only once an Iter-
ator is created from the Iterable (implicitly by the for each loop) that Crunch
runs the pipeline. When the pipeline has completed, the iteration can proceed over the ma-
terialized PCollection , and in this example the lowercase lines are printed to the con-
sole.
PTable has a materializeToMap() method, which might be expected to behave in
a similar way to materialize() . However, there are two important differences. First,
since it returns a Map<K, V> rather than an iterator, the whole table is loaded into
memory at once, which should be avoided for large collections. Second, although a PT-
able is a multi-map, the Map interface does not support multiple values for a single key,
Search WWH ::




Custom Search