Database Reference
In-Depth Information
NOTE
There is a trick for obtaining the DOT file when you don't have a
PipelineExecution
object, such
as when the pipeline is run synchronously or implicitly (see
Materialization
). Crunch stores the DOT file
representation in the job configuration, so it can be retrieved after the pipeline has finished:
PipelineResult result
=
pipeline
.
done
();
String dot
=
pipeline
.
getConfiguration
().
get
(
"crunch.planner.dotfile"
);
Files
.
write
(
dot
,
new
File
(
"pipeline.dot"
),
Charsets
.
UTF_8
);
Let's look at a plan for a nontrivial pipeline for calculating a histogram of word counts for
much longer than this one, with dozens of MapReduce jobs, but this illustrates some of
the characteristics of the Crunch planner.
Example 18-3. A Crunch pipeline for calculating a histogram of word counts
PCollection
<
String
>
lines
=
pipeline
.
readTextFile
(
inputPath
);
PCollection
<
String
>
lower
=
lines
.
parallelDo
(
"lower"
,
new
ToLowerFn
(),
strings
());
PTable
<
String
,
Long
>
counts
=
lower
.
count
();
PTable
<
Long
,
String
>
inverseCounts
=
counts
.
parallelDo
(
"inverse"
,
new
InversePairFn
<
String
,
Long
>(),
tableOf
(
longs
(),
strings
()));
PTable
<
Long
,
Integer
>
hist
=
inverseCounts
.
groupByKey
()
.
mapValues
(
"count values"
,
new
CountValuesFn
<
String
>(),
ints
());
hist
.
write
(
To
.
textFile
(
outputPath
),
Target
.
WriteMode
.
OVERWRITE
);
pipeline
.
done
();
The plan diagram generated from this pipeline is shown in
Figure 18-1
.
Sources and targets are rendered as folder icons. The top of the diagram shows the input
source, and the output target is shown at the bottom. We can see that there are two
MapReduce jobs (labeled Crunch Job 1 and Crunch Job 2), and a temporary sequence file
that Crunch generates to write the output of one job to so that the other can read it as in-
put. The temporary file is deleted when
clean()
is called at the end of the pipeline exe-
cution.
Crunch Job 2 (which is actually the one that runs first; it was just produced by the planner
second) consists of a map phase and a reduce phase, depicted by labeled boxes in the dia-
gram. Each map and reduce is decomposed into smaller operations, shown by boxes
labeled with names that correspond to the names of primitive Crunch operations in the