Database Reference
In-Depth Information
NOTE
There is a trick for obtaining the DOT file when you don't have a PipelineExecution object, such
as when the pipeline is run synchronously or implicitly (see Materialization ). Crunch stores the DOT file
representation in the job configuration, so it can be retrieved after the pipeline has finished:
PipelineResult result = pipeline . done ();
String dot = pipeline . getConfiguration (). get ( "crunch.planner.dotfile" );
Files . write ( dot , new File ( "pipeline.dot" ), Charsets . UTF_8 );
Let's look at a plan for a nontrivial pipeline for calculating a histogram of word counts for
text files stored in inputPath (see Example 18-3 ) . Production pipelines can grow to be
much longer than this one, with dozens of MapReduce jobs, but this illustrates some of
the characteristics of the Crunch planner.
Example 18-3. A Crunch pipeline for calculating a histogram of word counts
PCollection < String > lines = pipeline . readTextFile ( inputPath );
PCollection < String > lower = lines . parallelDo ( "lower" , new ToLowerFn (),
strings ());
PTable < String , Long > counts = lower . count ();
PTable < Long , String > inverseCounts = counts . parallelDo ( "inverse" ,
new InversePairFn < String , Long >(), tableOf ( longs (), strings ()));
PTable < Long , Integer > hist = inverseCounts
. groupByKey ()
. mapValues ( "count values" , new CountValuesFn < String >(), ints ());
hist . write ( To . textFile ( outputPath ), Target . WriteMode . OVERWRITE );
pipeline . done ();
The plan diagram generated from this pipeline is shown in Figure 18-1 .
Sources and targets are rendered as folder icons. The top of the diagram shows the input
source, and the output target is shown at the bottom. We can see that there are two
MapReduce jobs (labeled Crunch Job 1 and Crunch Job 2), and a temporary sequence file
that Crunch generates to write the output of one job to so that the other can read it as in-
put. The temporary file is deleted when clean() is called at the end of the pipeline exe-
cution.
Crunch Job 2 (which is actually the one that runs first; it was just produced by the planner
second) consists of a map phase and a reduce phase, depicted by labeled boxes in the dia-
gram. Each map and reduce is decomposed into smaller operations, shown by boxes
labeled with names that correspond to the names of primitive Crunch operations in the
Search WWH ::




Custom Search