Database Reference
In-Depth Information
Figure 18-1. Plan diagram for a Crunch pipeline for calculating a histogram of word counts
After the lowercasing operation, the next transformation in the program is to produce a
PTable
of counts of each unique string, using the built-in convenience method
count()
. This method actually performs three primitive Crunch operations: a
paral-
lelDo()
named Aggregate.count, a
groupByKey()
operation labeled GBK in the dia-
gram, and a
combineValues()
operation labeled combine.
Each GBK operation is realized as a MapReduce shuffle step, with the
groupByKey()
and
combineValues()
operations running in the reduce phase. The Aggregate.count
parallelDo()
operation runs in the map phase, but notice that it is run in the same
map as the lower operation: the Crunch planner attempts to minimize the number of
MapReduce jobs that it needs to run for a pipeline. In a similar way, the inverse
paral-
The last transformation is to take the inverted counts
PTable
and find the frequency of
each count. For example, if the strings that occur three times are
apple
and
orange
,
then the count of 3 has a frequency of 2. This transformation is another GBK operation,
which forces a new MapReduce job (Crunch Job 1), followed by a
mapValues()
oper-
ation that we named count values. The
mapValues()
operation is simply a
paral-
lelDo()
operation that can therefore be run in the reduce.
Notice that the map phase for Crunch Job 1 is omitted from the diagram since no primit-
ive Crunch operations are run in it.
Iterative Algorithms
A common use of
PObject
s is to check for convergence in an iterative algorithm. The
classic example of a distributed iterative algorithm is the PageRank algorithm for ranking
The control flow for a Crunch implementation of PageRank looks like this:
PTable
<
String
,
PageRankData
>
scores
=
readUrls
(
pipeline
,
urlInput
);
Float delta
=
1.0f
;
while
(
delta
>
0.01
) {
scores
=
pageRank
(
scores
,
0.5f
);
PObject
<
Float
>
pDelta
=
computeDelta
(
scores
);
delta
=
pDelta
.
getValue
();
}
Without going into detail on the operation of the PageRank algorithm itself, we can under-
stand how the higher-level program execution works in Crunch.