Crunch - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

The input is a text file with two URLs per line: a page and an outbound link from that

page. For example, the following file encodes the fact that A has links to B and C , and B

has a link to D :

www.A.com www.B.com

www.A.com www.C.com

www.B.com www.D.com

Going back to the code, the first line reads the input and computes an initial

PageRankData object for each unique page. PageRankData is a simple Java class

with fields for the score, the previous score (this will be used to check for convergence),

and a list of outbound links:

public static class PageRankData {

public float score ;

public float lastScore ;

public List < String > urls ;

// ... methods elided

}

The goal of the algorithm is to compute a score for each page, representing its relative im-

portance. All pages start out equal, so the initial score is set to be 1 for each page, and the

previous score is 0. Creating the list of outbound links is achieved using the Crunch oper-

ations of grouping the input by the first field (page), then aggregating the values (out-

bound links) into a list. [ 127 ]

The iteration is carried out using a regular Java while loop. The scores are updated in

each iteration of the loop by calling the pageRank() method, which encapsulates the

PageRank algorithm as a series of Crunch operations. If the delta between the last set of

scores and the new set of scores is below a small enough value (0.01), then the scores

have converged and the algorithm terminates. The delta is computed by the com-

puteDelta() method, a Crunch aggregation that finds the largest absolute difference in

page score for all the pages in the collection.

So when is the pipeline run? The answer is each time pDelta.getValue() is called.

The first time through the loop, no PCollection s have been materialized yet, so the

jobs for readUrls() , pageRank() , and computeDelta() must be run in order to

compute delta . On subsequent iterations only the jobs to compute the new scores

( pageRank() ) and delta ( computeDelta() ) need be run.

Search WWH ::

Custom Search

Home