Database Reference
In-Depth Information
Figure 18-1. Plan diagram for a Crunch pipeline for calculating a histogram of word counts
After the lowercasing operation, the next transformation in the program is to produce a
PTable of counts of each unique string, using the built-in convenience method
count() . This method actually performs three primitive Crunch operations: a paral-
lelDo() named Aggregate.count, a groupByKey() operation labeled GBK in the dia-
gram, and a combineValues() operation labeled combine.
Each GBK operation is realized as a MapReduce shuffle step, with the groupByKey()
and combineValues() operations running in the reduce phase. The Aggregate.count
parallelDo() operation runs in the map phase, but notice that it is run in the same
map as the lower operation: the Crunch planner attempts to minimize the number of
MapReduce jobs that it needs to run for a pipeline. In a similar way, the inverse paral-
lelDo() operation is run as a part of the preceding reduce. [ 125 ]
The last transformation is to take the inverted counts PTable and find the frequency of
each count. For example, if the strings that occur three times are apple and orange ,
then the count of 3 has a frequency of 2. This transformation is another GBK operation,
which forces a new MapReduce job (Crunch Job 1), followed by a mapValues() oper-
ation that we named count values. The mapValues() operation is simply a paral-
lelDo() operation that can therefore be run in the reduce.
Notice that the map phase for Crunch Job 1 is omitted from the diagram since no primit-
ive Crunch operations are run in it.
Iterative Algorithms
A common use of PObject s is to check for convergence in an iterative algorithm. The
classic example of a distributed iterative algorithm is the PageRank algorithm for ranking
the relative importance of each of a set of linked pages, such as the World Wide Web. [ 126 ]
The control flow for a Crunch implementation of PageRank looks like this:
PTable < String , PageRankData > scores = readUrls ( pipeline , urlInput );
Float delta = 1.0f ;
while ( delta > 0.01 ) {
scores = pageRank ( scores , 0.5f );
PObject < Float > pDelta = computeDelta ( scores );
delta = pDelta . getValue ();
}
Without going into detail on the operation of the PageRank algorithm itself, we can under-
stand how the higher-level program execution works in Crunch.
Search WWH ::




Custom Search